Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- No speed bump after vectorization (OpenCL)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

glodko

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-06-2011
06:55 AM

42 Views

No speed bump after vectorization (OpenCL)

[bash]for(n=0; n.x ; yy=current_atom.y-atoms .y ; zz=current_atom.z-atoms .z ; rr=xx*xx+yy*yy+zz*zz; // distance if (rr <= cutx && i!=n){ force=1.0f-rr/cutx; cg=current_atom.c*atoms .c*332.0716f/sqrt(rr); e=cg*f*f; energy+=e; tf =-e/rr-4.0f*cg*f/cutx; dxi+=tf*xx; dyi+=tf*yy; dzi+=tf*zz; } } [/bash]

With xx, yy, zz and rr I calculate distance between two atoms (atom's coordinates and charge is saved in array of structs). If distance is appropriate, then I calculate force and energy. "kN" is number of atoms. This kernel needs 18 seconds to calculate all forces and energies between 100.000 atoms.

Then I rewrite kernel using float4 data type. This should reduce calculation time. Here is the whole part of code using float4:

`[bash]`__kernel void calculate_forces(__global float4* atoms, __global float4* forces, const int kN){
__kernel __attribute__((vec_type_hint(float4)))
int i=get_global_id(0);
float cutoff=10.0f;
float cutx=cutoff*cutoff;
float4 distance;
float distance2, force, cg, e, energy, tf, dxi, dyi, dzi;
float charge_i=atoms*.w;
int n=0;
dxi=0.0f; dyi=0.0f; dzi=0.0f; energy=0.0f;
float4 i_atom_distance=(float4)(atoms**.x, atoms**.y, atoms**.z, 0.0f);
for(n; n*.x, atoms.y, atoms.z, 0.0f);
distance=i_atom_distance-n_atom_distance;
distance2=distance.x*distance.x+distance.y*distance.y+distance.z*distance.z;
if (distance2<=cutx && i!=n){
force=1.0f-distance2/cutx;
cg=charge_i*atoms.w*332.0716f/sqrt(distance2);
e=cg*force*force;
energy+=e;
tf =-e/distance2-4.0f*cg*force/cutx;
dxi=mad(tf, distance.x, dxi);
dyi=mad(tf, distance.y, dyi);
dzi=mad(tf, distance.z, dzi);
}
}
forces*.x=dxi; forces**.y=dyi; forces**.z=dzi; forces**.w=energy;
}
[/bash]*

Atom's coordinates and charge is now saved in array of float4 (like this: (float4)(x, y, z, charge)). Distance is now vectorized. I don't understand why there is no speed bump? For 100.000 atoms I need with float4 21 seconds - 3 seconds slower than without float4.I'm using Mac OS X Lion and Macbook Pro 2011 with Sandy Bridge CPU and Xcode. Any idea?

Link Copied

2 Replies

kalloyd

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-06-2011
09:28 AM

42 Views

BTW, why use a hard-cast floatx? Why not use a template function applicable mapping to myriad fields (ints, floats, doubles, etc)?

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-07-2011
06:09 AM

42 Views

With either compiler, you face the standard array of structures obstacle to vectorization. You may also need to organize your source code explicitly minimizing the part which is calculated conditionally.

Sandy Bridge AVX doesn't accelerate divide or sqrt significantly, so it would be worth while trying several target architectures.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.