- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

[bash]for(n=0; n.x ; yy=current_atom.y-atoms .y ; zz=current_atom.z-atoms .z ; rr=xx*xx+yy*yy+zz*zz; // distance if (rr <= cutx && i!=n){ force=1.0f-rr/cutx; cg=current_atom.c*atoms .c*332.0716f/sqrt(rr); e=cg*f*f; energy+=e; tf =-e/rr-4.0f*cg*f/cutx; dxi+=tf*xx; dyi+=tf*yy; dzi+=tf*zz; } } [/bash]

With xx, yy, zz and rr I calculate distance between two atoms (atom's coordinates and charge is saved in array of structs). If distance is appropriate, then I calculate force and energy. "kN" is number of atoms. This kernel needs 18 seconds to calculate all forces and energies between 100.000 atoms.

Then I rewrite kernel using float4 data type. This should reduce calculation time. Here is the whole part of code using float4:

`[bash]`__kernel void calculate_forces(__global float4* atoms, __global float4* forces, const int kN){
__kernel __attribute__((vec_type_hint(float4)))
int i=get_global_id(0);
float cutoff=10.0f;
float cutx=cutoff*cutoff;
float4 distance;
float distance2, force, cg, e, energy, tf, dxi, dyi, dzi;
float charge_i=atoms*.w;
int n=0;
dxi=0.0f; dyi=0.0f; dzi=0.0f; energy=0.0f;
float4 i_atom_distance=(float4)(atoms**.x, atoms**.y, atoms**.z, 0.0f);
for(n; n*.x, atoms.y, atoms.z, 0.0f);
distance=i_atom_distance-n_atom_distance;
distance2=distance.x*distance.x+distance.y*distance.y+distance.z*distance.z;
if (distance2<=cutx && i!=n){
force=1.0f-distance2/cutx;
cg=charge_i*atoms.w*332.0716f/sqrt(distance2);
e=cg*force*force;
energy+=e;
tf =-e/distance2-4.0f*cg*force/cutx;
dxi=mad(tf, distance.x, dxi);
dyi=mad(tf, distance.y, dyi);
dzi=mad(tf, distance.z, dzi);
}
}
forces*.x=dxi; forces**.y=dyi; forces**.z=dzi; forces**.w=energy;
}
[/bash]*

Atom's coordinates and charge is now saved in array of float4 (like this: (float4)(x, y, z, charge)). Distance is now vectorized. I don't understand why there is no speed bump? For 100.000 atoms I need with float4 21 seconds - 3 seconds slower than without float4.I'm using Mac OS X Lion and Macbook Pro 2011 with Sandy Bridge CPU and Xcode. Any idea?

Link Copied

2 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

BTW, why use a hard-cast floatx? Why not use a template function applicable mapping to myriad fields (ints, floats, doubles, etc)?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

With either compiler, you face the standard array of structures obstacle to vectorization. You may also need to organize your source code explicitly minimizing the part which is calculated conditionally.

Sandy Bridge AVX doesn't accelerate divide or sqrt significantly, so it would be worth while trying several target architectures.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page