Incorrect Results using OpenMP when compile with AVX flags

Frank_Fu · ‎08-28-2020

Hi,

My OMP code doesn't explicitly write any "_mm256_xxxx" functions or use "_mm256" data type.

I find that in the case running 2 more threads when compile flags add "-xCore-AVX2", I will get wrong results.

My platforms are Haswell Xeon E5-2695v3 with intel/19.5 on PSC bridges

and Skylake Xeon 8160 with intel/18.0.2 on TACC stampede2.

They both produce the wrong results

This is my code repo on GitHub.

Uncomment the `#define SAVE` in src/lb.h, as it will generate output.

Also, change the intel library path in `bridges.seq.Makefile` and `bridges.omp.Makefile`, both in line 42 and 44.

# Preparation
git clone https://github.com/qoofyk/2d-memory-aware-lbm
cd scons
tar xf scons-3.1.1.tar.gz
cd scons-3.1.1
mkdir build
python3 setup.py install --prefix=./build/

# Run the following command to build and verify
cd 2d-memory-aware-lbm/examples/unsteady
sh modify_bridges_Makefile.sh
sh compile.sh
sh verify.sh

I have tried to remove the -O3 flag, so I only use -qopenmp and -xCore-AVX2, but still generate wrong results. I find that the wrong results happened when using 2 or more threads.
Using 1 OMP thread or single-core code, and compile with the "AVX" flag, I get the correct result.

However, when I run with 2 or more threads,
Without the "AVX" flag, I got the correct results.
Once adding AVX, I got the wrong results.

When using 2 threads, given a 24x24 grid.

thread 0 will compute iX=1~12

thread 1 will compute iX=13~24

https://github.com/qoofyk/2d-memory-aware-lbm/blob/master/src/lb.c

Origin (sequential or 1 core) will first call collide(), then call propagate();

void collide(Simulation* sim) {
  for (int iX=1; iX<=sim->lx; ++iX) {
    for (int iY=1; iY<=sim->ly; ++iY) {
      collideNode(&(sim->lattice[iX][iY]));
    }
  }
}

  // apply propagation step with help of temporary memory
void propagate(Simulation* sim) {
    int lx = sim->lx;
    int ly = sim->ly;
    for (int iX=1; iX<=lx; ++iX) {
        for (int iY=1; iY<=ly; ++iY) {
            for (int iPop=0; iPop<9; ++iPop) {
                int nextX = iX + c[iPop][0];
                int nextY = iY + c[iPop][1];
                sim->tmpLattice[nextX][nextY].fPop[iPop] =
                    sim->lattice[iX][iY].fPop[iPop];
            }
        }
    }
    // exchange lattice and tmplattice
    Node** swapLattice = sim->lattice;
    sim->lattice = sim->tmpLattice;
    sim->tmpLattice = swapLattice;
}

Then this is origin_omp will first call collideOMP(), then call propagateOMP()

void collideOMP(Simulation* sim) {
#ifdef _OPENMP
#pragma omp parallel default(shared)
{
  #pragma omp for schedule(static, my_domain_H)
  for (int iX = 1; iX <= sim->lx; ++iX)
    for (int iY = 1; iY <= sim->ly; ++iY) {
      collideNode(&(sim->lattice[iX][iY]));
    }
}
#else
  printf("No OPENMP used");
#endif
}

void propagateOMP(Simulation* sim) {
  int lx = sim->lx;
  int ly = sim->ly;

#ifdef _OPENMP
#pragma omp parallel default(shared)
{
  #pragma omp for schedule(static, my_domain_H)
  for (int iX = 1; iX <= lx; ++iX)
    for (int iY = 1; iY <= ly; ++iY)
      for (int iPop = 0; iPop < 9; ++iPop) {
        int nextX = iX + c[iPop][0];
        int nextY = iY + c[iPop][1];
        sim->tmpLattice[nextX][nextY].fPop[iPop] =
          sim->lattice[iX][iY].fPop[iPop];
      }
}
#else
    printf("No OPENMP used");
#endif
  // exchange lattice and tmplattice
  Node** swapLattice = sim->lattice;
  sim->lattice = sim->tmpLattice;
  sim->tmpLattice = swapLattice;
}

I simply add #prama parallel at the collide and propagate for loop,

I guess my code after compiling with AVX2, there may be some race contention but I don't where is it.

Frank_Fu · ‎08-28-2020

Update:
I comment out the propagate function, now the code will only execute the collide function.

(We can do so in unsteady/origin and unsteady/origin_omp, line 122, comment out the stream_func(&sim);
But still gives me wrong results.

I think the reason stays in the collide function when compiling with AVX running 2 threads or more.

Frank_Fu · ‎08-28-2020

I think the problem lies in this collide function.
Most fluid grid point will call the following "bgk" function.
I currently comment out the pragma clause.
Use AVX, will generate the MOVEB, FMA instructions.
I guess somehow there is a problem here. Maybe FMA instructions in the for-loop of bgk when computing double-precision among several threads cause the floating-point related-error?


void computeMacros(double* f, double* rho, double* ux, double* uy) {
    double upperLine  = f[2] + f[5] + f[6];
    double mediumLine = f[0] + f[1] + f[3];
    double lowerLine  = f[4] + f[7] + f[8];
    *rho = upperLine + mediumLine + lowerLine;
    *ux  = (f[1] + f[5] + f[8] - (f[3] + f[6] + f[7]))/(*rho);
    *uy  = (upperLine - lowerLine)/(*rho);
}

  // compute local equilibrium from rho and u
double computeEquilibrium(int iPop, double rho,
                          double ux, double uy, double uSqr)
{
    double c_u = c[iPop][0]*ux + c[iPop][1]*uy;
    return rho * t[iPop] * (
               1. + 3.*c_u + 4.5*c_u*c_u - 1.5*uSqr
           );
}

  // bgk collision term
void bgk(double* fPop, void* selfData) {
    double omega = *((double*)selfData);
    double rho, ux, uy;
    computeMacros(fPop, &rho, &ux, &uy);
    double uSqr = ux*ux+uy*uy;

    // #pragma vector always
    // #pragma ivdep

    // #pragma omp simd
    for(int iPop = 0; iPop < 9; ++iPop) {
        fPop[iPop] *= (1. - omega);
        fPop[iPop] += omega * computeEquilibrium (
                                  iPop, rho, ux, uy, uSqr );
    }
}

Frank_Fu · ‎08-28-2020

Well, I fixed the bug by setting the initialization obst_r = 1;

Also every for loop use local variables.

AbhishekD_Intel · ‎08-31-2020

Hi,

Thank you for the post and glad to know that you came to know the cause of your issue. Also if you want more details on OpenMP and Vectorization please go through the below link.

https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support.html

Please update us if you have any other problem related to this thread.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎09-10-2020

Hi,

Please update us if you have any other problem related to this thread.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎09-21-2020

Hi,

I have not heard back from you, we won't be monitoring this thread. If you need further assistance, please post a new thread.

Warm Regards,

Abhishek