No speed bump after vectorization (OpenCL)

glodko · ‎10-06-2011

Hello. I'm working with NBody problem in OpenCL. I want to calculate force and energy between atoms.Hereis part of thesource code of kernel function, where CPU performs most of the computations:

[bash]for(n=0; n.x ;
    yy=current_atom.y-atoms.y ;
    zz=current_atom.z-atoms.z ;

    rr=xx*xx+yy*yy+zz*zz; // distance

    if (rr <= cutx && i!=n){

	      force=1.0f-rr/cutx;
	      cg=current_atom.c*atoms.c*332.0716f/sqrt(rr);
	      e=cg*f*f;
	      energy+=e;

	      tf =-e/rr-4.0f*cg*f/cutx;
	      dxi+=tf*xx;
	      dyi+=tf*yy;
	      dzi+=tf*zz;
    }
  }
[/bash]

With xx, yy, zz and rr I calculate distance between two atoms (atom's coordinates and charge is saved in array of structs). If distance is appropriate, then I calculate force and energy. "kN" is number of atoms. This kernel needs 18 seconds to calculate all forces and energies between 100.000 atoms.

Then I rewrite kernel using float4 data type. This should reduce calculation time. Here is the whole part of code using float4:

[bash]__kernel void calculate_forces(__global float4* atoms, __global float4* forces, const int kN){
    
__kernel __attribute__((vec_type_hint(float4)))
  
    int i=get_global_id(0);
    float cutoff=10.0f; 
    float cutx=cutoff*cutoff;
    float4 distance;
    float distance2, force, cg, e, energy, tf, dxi, dyi, dzi;
    float charge_i=atoms.w;
    int n=0;

    dxi=0.0f; dyi=0.0f; dzi=0.0f; energy=0.0f;

    float4 i_atom_distance=(float4)(atoms.x, atoms.y, atoms.z, 0.0f);
    
    for(n; n.x, atoms.y, atoms.z, 0.0f);
    
        distance=i_atom_distance-n_atom_distance;

        distance2=distance.x*distance.x+distance.y*distance.y+distance.z*distance.z;

        if (distance2<=cutx && i!=n){

            force=1.0f-distance2/cutx;
            cg=charge_i*atoms.w*332.0716f/sqrt(distance2);
            e=cg*force*force;
            energy+=e;

            tf =-e/distance2-4.0f*cg*force/cutx;
            dxi=mad(tf, distance.x, dxi);
            dyi=mad(tf, distance.y, dyi);
            dzi=mad(tf, distance.z, dzi);
        }
    }
    forces.x=dxi;  forces.y=dyi;  forces.z=dzi; forces.w=energy;
}
[/bash]

Atom's coordinates and charge is now saved in array of float4 (like this: (float4)(x, y, z, charge)). Distance is now vectorized. I don't understand why there is no speed bump? For 100.000 atoms I need with float4 21 seconds - 3 seconds slower than without float4.I'm using Mac OS X Lion and Macbook Pro 2011 with Sandy Bridge CPU and Xcode. Any idea?

kalloyd · ‎10-06-2011

Obviously, see NVIDIA's whitepapers on NBodies and Particles, especially in the partitioning of data for parallel processing, which is probably where you're not getting any "bump". There are many approaches to this problem space, from discrete (like particle-in-cell) to hybrid continuous/discrete solutions.

BTW, why use a hard-cast floatx? Why not use a template function applicable mapping to myriad fields (ints, floats, doubles, etc)?

TimP · ‎10-07-2011

What does the compiler vectorization report say? Any reason, when running on Sandy Bridge, for not working with a C++ compiler?
With either compiler, you face the standard array of structures obstacle to vectorization. You may also need to organize your source code explicitly minimizing the part which is calculated conditionally.
Sandy Bridge AVX doesn't accelerate divide or sqrt significantly, so it would be worth while trying several target architectures.