Strange results on MIC

Riccardo_G_ · ‎09-22-2014

Hello, for my thesis I have run a simple code used to study a Lennard Jones system on a Xeon Phi coprocessor and I tried to vectorize it and study the variations on execution time.
The machine I used in particular has 61 cores with 32 kB of L1 cache and 512 kB of L2 cache, the vector register can memorize 512 bit.

I implemented the code with, and without, the cell-list method and used different numbers of particles, in particular from 512 to 16384, doubling it each time.
Positions and forces are memorized in three different vectors (rx,ry,rz and fx,fy,fz).
I have good results in the case without the cell-list but in the other one I have some strange results.

The dependence between the cell-list and the number of particle should be linear with the cell-list method implemented, indeed I obtained a straight line plotting the time over the number of particles, but with N=8192 and N=16384 the time of execution is much higher.
I tried to do some calculation with values of N near these values but the scaling is correct for each other number, only for those two there's a problem.
To make it clear I report some value:

N      Time 
512    6.14995
1024   11.1381
2048   23.1964
4096   51.9393
6144   78.1251
8192   389.724
10240  144.173
12288  167.772
14336  209.669
16384  822.131

I think is a technical problem but I really don't know exactly why this happens.
I also observed a really low variation using the vectorization, without the cell-list I observed a variation of a factor 4x, more or less, but with the cell-list it's only around 1.5x.

Questions:

Does anybody have an idea of what could the problem be? Why those particular values are strange and why the vectorization gain is so low?

My professor told me that can happen that some values show strange results on the execution, did anybody observe something like this?

Thank you very much.

Below I report the main loop in which are evaluated the forces, in few words the main part of the execution, implemented with the cell-list.

for(vcy=0; vcy<ncell; vcy++){

    for(vcx=0; vcx<ncell; vcx++){

        previouspartc=0;

        // Central cell index
        c=vcx*ncell+vcy;

        // Define previouspart
        for(p=1; p<=c; p++) previouspartc=previouspartc+npart[p-1];

        // Loop over central cell's particles
        for(i=0; i<npart[c]-1; i++){

            for(j=i+1; j<npart[c]; j++){

                ftempx=0.; ftempy=0.;
                dx =rx1[previouspartc+i]-rx1[previouspartc+j];
                dy =ry1[previouspartc+i]-ry1[previouspartc+j];
                dx = (dx + 0.5*dy)*L;
                dy = dy*halfsq3*L;
                r2 = dx*dx + dy*dy;
                if(r2<r2cut) {

                    rr2 = 1./r2;
                    rr6 = rr2*rr2*rr2;
                    enk+=(c12*rr6 -c6)*rr6 -ecut;
                    vir=(cf12*rr6-cf6)*rr6*rr2;
                    ftempx=vir*dx; 
                    ftempy=vir*dy;
                }
                fx1[previouspartc+i]+=ftempx;               
                fy1[previouspartc+i]+=ftempy;
                fx1[previouspartc+j]-=ftempx;               
                fy1[previouspartc+j]-=ftempy;
            }
        }

        // Create the two indexes vcx1, vcy1 of the neighbour cells (the one on the right and the three under)
        vcx1[0]=vcx+1;   vcy1[0]=vcy;
        for(k=1; k<4; k++){

            vcx1[k]=vcx-1+(k-1);
            vcy1[k]=vcy-1;  
        }

        // Loop over near cells
        for(k=0; k<4; k++){

            previouspartc1=0;

            // PBC
            shiftx=0.; shifty=0.;

            if(vcx1[k] <0){ shiftx= -1; vcx1[k]=ncell-1;}

            else if(vcx1[k] >=ncell){ shiftx= 1; vcx1[k]=0;}

            if(vcy1[k] <0){ shifty= -1; vcy1[k]=ncell-1;}

            else if(vcy1[k] >=ncell){ shifty= 1; vcy1[k]=0;}    

            // Scalar cell index of neighbour cell
            c1=vcx1[k]*ncell+vcy1[k];


            // Define previouspart
            for(p=1; p<=c1; p++) previouspartc1=previouspartc1+npart[p-1];

            for(i=0; i<npart[c]; i++){

                for(j=0; j<npart[c1]; j++){

                    ftempx=0.; ftempy=0.;
                    dx =rx1[previouspartc+i]-(rx1[previouspartc1+j]+shiftx);
                    dy =ry1[previouspartc+i]-(ry1[previouspartc1+j]+shifty);
                    dx = (dx + 0.5*dy)*L;
                    dy = dy*halfsq3*L;
                    r2 = dx*dx + dy*dy;
                    if(r2<r2cut) {

                        rr2 = 1./r2;
                        rr6 = rr2*rr2*rr2;
                        enk+=(c12*rr6 -c6)*rr6 -ecut;
                        vir=(cf12*rr6-cf6)*rr6*rr2;
                        ftempx=vir*dx; 
                        ftempy=vir*dy;
                    }
                    fx1[previouspartc+i]+=ftempx;               
                    fy1[previouspartc+i]+=ftempy;
                    fx1[previouspartc1+j]-=ftempx;
                    fy1[previouspartc1+j]-=ftempy;
                }
            }
        } 
    }
}

jimdempseyatthecove · ‎09-22-2014

You are likely seeing the effects of cache line evictions due to false sharing. Try Googling:

cache "false sharing" MIC site:intel.com

Type the above line as-is into the Google search.

Jim Dempsey

TimP · ‎09-22-2014

Pursuant to Jim's hint, if these loops are in a parallel region (as they must be, to get satisfactory performance), you need a bunch of local definitions like

float tempx=0, tempy=0,dx,dy,.....;

unless you have taken the more difficult route of including them in a private clause.

Having all threads work on shared copies of these variables will produce wrong results and erratic performance, unless you're lucky enough that it crashes.

Riccardo_G_ · ‎09-23-2014

First of all, thank you for the answers.

I must say that i didn't parallelize the code, I know that with parallelization a better performance can be reached, but the aim of my study was to analyze the vectorization of the code explained above.

Given this, the false sharing can occur when only one core is used ?

Thank you again.

jimdempseyatthecove · ‎09-23-2014

On this forum, search for posts from McCalpin, he gave a good description of conflicts that can arise when using (multiples of) 4KB intervals between threads.

A guy walks into the doctor's office and complains "it hurts when I do this". The doctor replies "don't do that".

We hear about tuning to find the "sweet spot".

We seldom hearing about avoiding the "sour spot".

You found two "sour spots". It is best to avoid those.

Jim Dempsey

William_B_Intel · ‎09-23-2014

Here are a couple of additional suggestions that might help:

I assume this is inner loop vectorization (for j), I think that you will want to keep the fx, fy in register for the i atom and update only once outside of the loop (using pragma simd reduction(+:)).

It will also help to make sure that the starting index for position/force for each cell is aligned for vectorization.

It might also help to only use aligned access for the j loop in self cell (by zeroing out j<i+1), depending on the number of atoms per cell.

It could be that if you are doing short runs with positions initially on a lattice, with a changing cell size and initial constant atoms/cell that the 8192/16384 result in special cases with regards to alignment, trip count, etc.

I assume that the number of cells in x and y is the same. If not, note that the 3 cells "under" (and periodic boundary cell) will become increasingly far in memory as number of cells in fastest dimension increases and see if there are any special cases there causing an issue.

- Mike

jimdempseyatthecove · ‎09-24-2014

Why not also move the +/- ftemp's into the conditional } (and remove the 0'ing of the temps)

This way you will not be +/-'ing 0 to the force accumulators.

for that matter you could move the [...i] accumulation out one loop level by using a temp. This would reduce the number of RMW by about half.

Jim Dempsey

jimdempseyatthecove · ‎09-24-2014

I will make an assumption that for each particle, the number of other particles within the r2cut varies greatly. If this is the case, then the amount of work also varies greatly. If you apply standard OpenMP loop partitioning, each thread would get the same number of cells, and this may not be an optimal load balance. You might want to experiment with adding some heuristics like saving the number of inside r2cuts per cell on each process loop. Then on the next process loop, divide the work up by the (sum of the inside r2cuts + nThread-1) / nThreads work size chunks. Then each thread in the process loop would scan-sum the cell tallies until the reach their starting cell. Then iterate your process loop until the scan-sum of the tallies exceed their work size chunk. This would require two sets of within r2cut tallies (prior and next).

Jim Dempsey

Riccardo_G_ · ‎09-24-2014

William B.: I focused on the vectorization of the inner loop and all vectors are correctly aligned.

The number of cells is the same on all directions.

I understand that those two values create some kind of particular problem, I'm just trying to understand what it can exactly be, because I really do not know. I also know that running a program on MIC using only one core is not so common.

Jim: Normally the force update is inside the branch but the vectorization gives vector dependencies so I must use this form.

jimdempseyatthecove · ‎09-24-2014

Riccardo,

Have you experimented with removing the r2cut test?

i.e. if any of the lanes of the vector enter the section, all lanes will have the same latency through the section. The test and branch to avoid code may be counter-productive considering the wide vector on MIC.

Jim Dempsey

Riccardo_G_ · ‎09-24-2014

Unluckily I can't use the machine anymore, it would be interesting.

Riccardo Gnudi.