The compiler flag -march=native destroys the program behaviour.

Galyuzov__Andrey · ‎10-01-2018

Hallo. I have a code of about 2000 lines, which i can't post anywhere because it is the result of my current work. I launch the code on Intel Xeon Phi Knights Landing (KNL) 7210 (64 cores) processor and use the Intel c++ compiler (icpc) version 17.0.4. Also i launch the code on Intel core i7 processor, where the version of icpc is 17.0.1. The loops are parallelized and vecrorized using OpenMP. For this purpose i use the intel compiler flags

-DCMAKE_CXX_COMPILER="-march=native -mtune=native -ipo16 -fp-model fast=2 -O3 -qopt-report=5 -mcmodel=large"

(qopt - for optimization report). If it is no the full set of flags for maximal performance, please,correct me and tell the right flags. I used these flags earlier, when i launched more simple programs, and everythingworked well on KNL and i7. Now i have a more difficult code, where there are calls of big functions from OpenMP parallelregions. The problem is that the binary file compiled with the written above set of flags works well on i7 and gives Floating Point Exception immediately on KNL. If to use GDB, the error is in the line pp+=alpha/rd; of the piece of code:

...

the code above is run in 1 thread

double K1=0.0, P=0.0;

#pragma omp parallel for reduction(+:P_x,P_y,P_z, K1,P)

for(int i=0; i<N; ++i)
{
    P_x+=p.vx*p.m;
    P_y+=p.vy*p.m;
    P_z+=p.vz*p.m;
    K1+=p.vx*p.vx+p.vy*p.vy+p.vz*p.vz;
    float pp=0.0;
#pragma simd reduction(+:pp)
    for(int j=0; j<N; ++j) if(i!=j)
    {
      float rd=sqrt((p.x-p.x)*(p.x-p.x)+(p.y-p.y)*(p.y-p.y)+(p.z-p.z)*(p.z-p.z));
      pp+=alpha/rd;
    }
    P+=pp;
}

...

in the beginning of the program. alpha is a float constant. Particle p; - an array of particles, Particle is a structure of floats. N - maximal number of particles.

If to remove the flag "-march=native" or replace it with "-march=knl" or with "-march=core-avx2", everything woks OK. This flag is doing somethig bad to the program,but what - i don't know.

I found in the Internet (https://software.intel.com/en-us/articles/porting-applications-from-knights-corner-to-knights-landing, https://math-linux.com/linux/tip-of-the-day/article/intel-compilation-for-mic-architecture-knl-knights-landing andhttps://opus.nci.org.au/display/Help/Intel+Knights+Landing+Compute+Nodes) that one should use the flags: "-xMIC-AVX512". I also tried "-axMIC-AVX512", but they give the same error.

So my question is:

1) Why "-march=native", "-xMIC-AVX512" does not work and -march=knl" works?

2) May i replace the flag "-march=native" with "-march=knl" when i launch the code on KNL (on i7 everything works), are the equivalent?

3) Is the set of flags written optimal for the best performance and are "-march=native" and "-march=knl" equivalent when launch the code on knl?

Please, help.

Galyuzov__Andrey · ‎10-01-2018

The KNL we use is a PC. So, the code is compiled and run on a KNL PC without offload.

TimP · ‎10-03-2018

If the same issue is reproduced with a current compiler (2018 with updates or 2019), it looks worth submitting a problem report at online service center. It may be interesting to see what differences, if any, the opt-report=4 shows among the options you are trying. I suspect that -march=knl may be ignored, and only the -march=native or -xMIC-AVX512 are performing the aggressive optimizations you are requesting. Depending on the length of your inner loop, forcing AVX512 vectorization may not give a performance benefit over 128- or 256- width vectorizations. omp simd doesn't guarantee vectorization, if for example, the compiler doesn't find a suitable instruction set choice for the I7. If you have time to experiment, you might try -xMIC-AVX512 with less aggressive optimization, such as -fp-model fast=1, -fp-model source, to see whether you can find a satisfactory balance among accuracy and performance.

Another possible experiment would be to split the inner loop to (int j=0;j<i;++j) and (int j=i+1;j<N;++j) to see if that improves performance of I7 code or corrects the failure on AVX512.

Loc_N_Intel · ‎10-05-2018

Hi Andrey,

I wrote a similar program and compiled it in a KNL machine using both Intel Compiler 17.0.2 and 18.0.0 . For each compiler version, .I compiled the program with the flag -xMIC-AVX512, -march=knl, and -march=native respectively. Here is my observation:

A. Better performance with compiler version 18.

B. Both -xMIC-AVX512 and -marc=native give the same performance when compiling on KNL

C. -march=knl is not valid for compiler version 17.0.2. Thus, this flag is ignored for compiler 17.0.2 .

In your case, did you compile your program on Intel core i7 processor or KNL? Can you compile it on a KNL machine and see if it works?

Galyuzov__Andrey · ‎10-08-2018

I got the answer. In the code i used feenableexcept() function to detect floating point exception():

...

void handler(int sig)
{
printf("Floating Point Exception\n");
exit(0);
}

...

main(int argc, char **argv)
{
//handling Not a number exception:
feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW);
signal(SIGFPE, handler);
...
}.

I was told that if i use feenableexcept(), i shouldn't use "-fp-model fast=1,2" and "-O2,3" aggressive compiler optimization options. I should use "-fp-model precise -fp-model except" (for example) and "-O0,1". If to use aggressive compiler optimization options, one shouldn't use feenableexcept(). I checked: if to just comment feenableexcept() and use the compile line

-DCMAKE_CXX_COMPILER="-march=native -mtune=native -ipo16 -fp-model fast=2 -O3 -qopt-report=5 -mcmodel=large",

everything works fine on KNL.

So the problem was caused by using "-march=native" with feenableexcept() on KNL (although on i7 this worked).If to comment feenableecept(), everything works fine.

If You could explain to me in simple words why ("-march=native" with feenableexcept() lead to floating point exception) or give a link on the Internet where to find an answer, i would be grateful.