- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, for my thesis I have run a simple code used to study a Lennard Jones system on a Xeon Phi coprocessor and I tried to vectorize it and study the variations on execution time.
The machine I used in particular has 61 cores with 32 kB of L1 cache and 512 kB of L2 cache, the vector register can memorize 512 bit.
I implemented the code with, and without, the cell-list method and used different numbers of particles, in particular from 512 to 16384, doubling it each time.
Positions and forces are memorized in three different vectors (rx,ry,rz and fx,fy,fz).
I have good results in the case without the cell-list but in the other one I have some strange results.
The dependence between the cell-list and the number of particle should be linear with the cell-list method implemented, indeed I obtained a straight line plotting the time over the number of particles, but with N=8192 and N=16384 the time of execution is much higher.
I tried to do some calculation with values of N near these values but the scaling is correct for each other number, only for those two there's a problem.
To make it clear I report some value:
N Time
512 6.14995
1024 11.1381
2048 23.1964
4096 51.9393
6144 78.1251
8192 389.724
10240 144.173
12288 167.772
14336 209.669
16384 822.131
I think is a technical problem but I really don't know exactly why this happens.
I also observed a really low variation using the vectorization, without the cell-list I observed a variation of a factor 4x, more or less, but with the cell-list it's only around 1.5x.
Questions:
Does anybody have an idea of what could the problem be? Why those particular values are strange and why the vectorization gain is so low?
My professor told me that can happen that some values show strange results on the execution, did anybody observe something like this?
Thank you very much.
Below I report the main loop in which are evaluated the forces, in few words the main part of the execution, implemented with the cell-list.
for(vcy=0; vcy<ncell; vcy++){
for(vcx=0; vcx<ncell; vcx++){
previouspartc=0;
// Central cell index
c=vcx*ncell+vcy;
// Define previouspart
for(p=1; p<=c; p++) previouspartc=previouspartc+npart[p-1];
// Loop over central cell's particles
for(i=0; i<npart[c]-1; i++){
for(j=i+1; j<npart[c]; j++){
ftempx=0.; ftempy=0.;
dx =rx1[previouspartc+i]-rx1[previouspartc+j];
dy =ry1[previouspartc+i]-ry1[previouspartc+j];
dx = (dx + 0.5*dy)*L;
dy = dy*halfsq3*L;
r2 = dx*dx + dy*dy;
if(r2<r2cut) {
rr2 = 1./r2;
rr6 = rr2*rr2*rr2;
enk+=(c12*rr6 -c6)*rr6 -ecut;
vir=(cf12*rr6-cf6)*rr6*rr2;
ftempx=vir*dx;
ftempy=vir*dy;
}
fx1[previouspartc+i]+=ftempx;
fy1[previouspartc+i]+=ftempy;
fx1[previouspartc+j]-=ftempx;
fy1[previouspartc+j]-=ftempy;
}
}
// Create the two indexes vcx1, vcy1 of the neighbour cells (the one on the right and the three under)
vcx1[0]=vcx+1; vcy1[0]=vcy;
for(k=1; k<4; k++){
vcx1[k]=vcx-1+(k-1);
vcy1[k]=vcy-1;
}
// Loop over near cells
for(k=0; k<4; k++){
previouspartc1=0;
// PBC
shiftx=0.; shifty=0.;
if(vcx1[k] <0){ shiftx= -1; vcx1[k]=ncell-1;}
else if(vcx1[k] >=ncell){ shiftx= 1; vcx1[k]=0;}
if(vcy1[k] <0){ shifty= -1; vcy1[k]=ncell-1;}
else if(vcy1[k] >=ncell){ shifty= 1; vcy1[k]=0;}
// Scalar cell index of neighbour cell
c1=vcx1[k]*ncell+vcy1[k];
// Define previouspart
for(p=1; p<=c1; p++) previouspartc1=previouspartc1+npart[p-1];
for(i=0; i<npart[c]; i++){
for(j=0; j<npart[c1]; j++){
ftempx=0.; ftempy=0.;
dx =rx1[previouspartc+i]-(rx1[previouspartc1+j]+shiftx);
dy =ry1[previouspartc+i]-(ry1[previouspartc1+j]+shifty);
dx = (dx + 0.5*dy)*L;
dy = dy*halfsq3*L;
r2 = dx*dx + dy*dy;
if(r2<r2cut) {
rr2 = 1./r2;
rr6 = rr2*rr2*rr2;
enk+=(c12*rr6 -c6)*rr6 -ecut;
vir=(cf12*rr6-cf6)*rr6*rr2;
ftempx=vir*dx;
ftempy=vir*dy;
}
fx1[previouspartc+i]+=ftempx;
fy1[previouspartc+i]+=ftempy;
fx1[previouspartc1+j]-=ftempx;
fy1[previouspartc1+j]-=ftempy;
}
}
}
}
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are likely seeing the effects of cache line evictions due to false sharing. Try Googling:
cache "false sharing" MIC site:intel.com
Type the above line as-is into the Google search.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pursuant to Jim's hint, if these loops are in a parallel region (as they must be, to get satisfactory performance), you need a bunch of local definitions like
float tempx=0, tempy=0,dx,dy,.....;
unless you have taken the more difficult route of including them in a private clause.
Having all threads work on shared copies of these variables will produce wrong results and erratic performance, unless you're lucky enough that it crashes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First of all, thank you for the answers.
I must say that i didn't parallelize the code, I know that with parallelization a better performance can be reached, but the aim of my study was to analyze the vectorization of the code explained above.
Given this, the false sharing can occur when only one core is used ?
Thank you again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On this forum, search for posts from McCalpin, he gave a good description of conflicts that can arise when using (multiples of) 4KB intervals between threads.
A guy walks into the doctor's office and complains "it hurts when I do this". The doctor replies "don't do that".
We hear about tuning to find the "sweet spot".
We seldom hearing about avoiding the "sour spot".
You found two "sour spots". It is best to avoid those.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are a couple of additional suggestions that might help:
I assume this is inner loop vectorization (for j), I think that you will want to keep the fx, fy in register for the i atom and update only once outside of the loop (using pragma simd reduction(+:)).
It will also help to make sure that the starting index for position/force for each cell is aligned for vectorization.
It might also help to only use aligned access for the j loop in self cell (by zeroing out j<i+1), depending on the number of atoms per cell.
It could be that if you are doing short runs with positions initially on a lattice, with a changing cell size and initial constant atoms/cell that the 8192/16384 result in special cases with regards to alignment, trip count, etc.
I assume that the number of cells in x and y is the same. If not, note that the 3 cells "under" (and periodic boundary cell) will become increasingly far in memory as number of cells in fastest dimension increases and see if there are any special cases there causing an issue.
- Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not also move the +/- ftemp's into the conditional } (and remove the 0'ing of the temps)
This way you will not be +/-'ing 0 to the force accumulators.
for that matter you could move the [...i] accumulation out one loop level by using a temp. This would reduce the number of RMW by about half.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will make an assumption that for each particle, the number of other particles within the r2cut varies greatly. If this is the case, then the amount of work also varies greatly. If you apply standard OpenMP loop partitioning, each thread would get the same number of cells, and this may not be an optimal load balance. You might want to experiment with adding some heuristics like saving the number of inside r2cuts per cell on each process loop. Then on the next process loop, divide the work up by the (sum of the inside r2cuts + nThread-1) / nThreads work size chunks. Then each thread in the process loop would scan-sum the cell tallies until the reach their starting cell. Then iterate your process loop until the scan-sum of the tallies exceed their work size chunk. This would require two sets of within r2cut tallies (prior and next).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
William B.: I focused on the vectorization of the inner loop and all vectors are correctly aligned.
The number of cells is the same on all directions.
I understand that those two values create some kind of particular problem, I'm just trying to understand what it can exactly be, because I really do not know. I also know that running a program on MIC using only one core is not so common.
Jim: Normally the force update is inside the branch but the vectorization gives vector dependencies so I must use this form.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Riccardo,
Have you experimented with removing the r2cut test?
i.e. if any of the lanes of the vector enter the section, all lanes will have the same latency through the section. The test and branch to avoid code may be counter-productive considering the wide vector on MIC.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unluckily I can't use the machine anymore, it would be interesting.
Riccardo Gnudi.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page