Hi,
My OMP code doesn't explicitly write any "_mm256_xxxx" functions or use "_mm256" data type.
I find that in the case running 2 more threads when compile flags add "-xCore-AVX2", I will get wrong results.
My platforms are Haswell Xeon E5-2695v3 with intel/19.5 on PSC bridges
and Skylake Xeon 8160 with intel/18.0.2 on TACC stampede2.
They both produce the wrong results
This is my code repo on GitHub.
Uncomment the `#define SAVE` in src/lb.h, as it will generate output.
Also, change the intel library path in `bridges.seq.Makefile` and `bridges.omp.Makefile`, both in line 42 and 44.
I have tried to remove the -O3 flag, so I only use -qopenmp and -xCore-AVX2, but still generate wrong results. I find that the wrong results happened when using 2 or more threads.
Using 1 OMP thread or single-core code, and compile with the "AVX" flag, I get the correct result.
However, when I run with 2 or more threads,
Without the "AVX" flag, I got the correct results.
Once adding AVX, I got the wrong results.
When using 2 threads, given a 24x24 grid.
thread 0 will compute iX=1~12
thread 1 will compute iX=13~24
https://github.com/qoofyk/2d-memory-aware-lbm/blob/master/src/lb.c
Origin (sequential or 1 core) will first call collide(), then call propagate();
void collide(Simulation* sim) {
for (int iX=1; iX<=sim->lx; ++iX) {
for (int iY=1; iY<=sim->ly; ++iY) {
collideNode(&(sim->lattice[iX][iY]));
}
}
}
// apply propagation step with help of temporary memory
void propagate(Simulation* sim) {
int lx = sim->lx;
int ly = sim->ly;
for (int iX=1; iX<=lx; ++iX) {
for (int iY=1; iY<=ly; ++iY) {
for (int iPop=0; iPop<9; ++iPop) {
int nextX = iX + c[iPop][0];
int nextY = iY + c[iPop][1];
sim->tmpLattice[nextX][nextY].fPop[iPop] =
sim->lattice[iX][iY].fPop[iPop];
}
}
}
// exchange lattice and tmplattice
Node** swapLattice = sim->lattice;
sim->lattice = sim->tmpLattice;
sim->tmpLattice = swapLattice;
}
Then this is origin_omp will first call collideOMP(), then call propagateOMP()
void collideOMP(Simulation* sim) {
#ifdef _OPENMP
#pragma omp parallel default(shared)
{
#pragma omp for schedule(static, my_domain_H)
for (int iX = 1; iX <= sim->lx; ++iX)
for (int iY = 1; iY <= sim->ly; ++iY) {
collideNode(&(sim->lattice[iX][iY]));
}
}
#else
printf("No OPENMP used");
#endif
}
void propagateOMP(Simulation* sim) {
int lx = sim->lx;
int ly = sim->ly;
#ifdef _OPENMP
#pragma omp parallel default(shared)
{
#pragma omp for schedule(static, my_domain_H)
for (int iX = 1; iX <= lx; ++iX)
for (int iY = 1; iY <= ly; ++iY)
for (int iPop = 0; iPop < 9; ++iPop) {
int nextX = iX + c[iPop][0];
int nextY = iY + c[iPop][1];
sim->tmpLattice[nextX][nextY].fPop[iPop] =
sim->lattice[iX][iY].fPop[iPop];
}
}
#else
printf("No OPENMP used");
#endif
// exchange lattice and tmplattice
Node** swapLattice = sim->lattice;
sim->lattice = sim->tmpLattice;
sim->tmpLattice = swapLattice;
}
I simply add #prama parallel at the collide and propagate for loop,
I guess my code after compiling with AVX2, there may be some race contention but I don't where is it.