-no-vec is a pretty big

Vladimir_Dergachev · ‎05-23-2013

I have a very baffling segfault in automatically vectorized code, perhaps someone have seen it before and can offer some suggestions..

First of all, this is a plain C++ code, compiled with icpc using the following switches:

-fopenmp -O3 -mmic -openmp -fma -inline-debug-info -fp-model fast -DUSE_SSE=0 -DUSE_AVX=0 -DUSE_MIC=0 -DUSE_PFLOAT=1 -DUSE_RFLOAT=1 -DDEBUG=0 -mcmodel=medium -debug parallel

If I drop -mmic the code runs fine on Xeon E5 processor. I can see the error also with -O2 and -O. The code executes for a while running the same instructions before the segfault occurs.

If I run the code on Xeon Phi it crashes in similar looking pieces of C code. The time it crashes varies with the number of threads and compile settings. I have seen it crash when running single-threaded (but this takes a long time).

C code:

sum = 0.0;
cnt = 0;
for(register int i = shift; i < K; i++){
    if(outlier || outlier[i-shift])
    continue;
    register double y = Yfrag[i-shift];
    register double x = Xfrag;
    register double vary = varF + var*(x+y);
    register double err = x-y;
    if(DEBUG) assert(vary > 0.0);
    sum += err*err/vary;
    cnt++;
}

I have tried to remove register keyword thinking that this confuses the compiler, with no change in segfault.

Using gdb, the crash occurs in assembler instruction corresponding to

register double x = Xfrag;

the assembler dump from gdb is

   0x00000000004cfa54 <+42292>: movabs $0x63f680,%r13
   0x00000000004cfa5e <+42302>: vmovapd 0x8(%r11,%r14,1),%zmm30{%k3}
   0x00000000004cfa69 <+42313>: kxnor %k0,%k0
   0x00000000004cfa6d <+42317>: vgatherdpd 0x40(%rcx,%zmm24,8),%zmm31{%k5}
   0x00000000004cfa75 <+42325>: jkzd   0x4cfa87 <Calign::IsRepeatRegion()+42343>,%k5
   0x00000000004cfa7a <+42330>: vgatherdpd 0x40(%rcx,%zmm24,8),%zmm31{%k5}
   0x00000000004cfa82 <+42338>: jknzd 0x4cfa6d <Calign::IsRepeatRegion()+42317>,%k5
   0x00000000004cfa87 <+42343>: vpxorq %zmm2,%zmm2,%zmm2
   0x00000000004cfa8d <+42349>: vaddpd %zmm30,%zmm29,%zmm11
=> 0x00000000004cfa93 <+42355>: vmovapd 0x48(%r11,%r14,1),%zmm2{%k2}
   0x00000000004cfa9e <+42366>: vsubpd %zmm29,%zmm30,%zmm25
   0x00000000004cfaa4 <+42372>: vaddpd %zmm2,%zmm31,%zmm3
   0x00000000004cfaaa <+42378>: vpermf32x4 $0x0,%zmm27,%zmm1
   0x00000000004cfab1 <+42385>: vpermf32x4 $0x0,%zmm28,%zmm12
   0x00000000004cfab8 <+42392>: vsubpd %zmm31,%zmm2,%zmm26

A particular puzzle to me is what exactly causes a segfault. The register values are

(gdb) print $r11
$1 = 0
(gdb) print $r14
$2 = 140726616854456
(gdb) print $zmm2
$3 = {v16_float = {0 <repeats 16 times>}, v8_double = {0, 0, 0, 0, 0, 0, 0, 0}, v64_int8 = {0 <repeats 64 times>}, v32_int16 = {0 <repeats 32 times>}, v16_int32 = {0 <repeats 16 times>}, v8_int64 = {0, 0, 0,
0, 0, 0, 0, 0}, v4_uint128 = {0, 0, 0, 0}}
(gdb) print $k2
$4 = 0
(gdb) print ((double *)$r14)[0]
$5 = 5.4553999999999974
(gdb) print ((double *)$r14)[1]
$6 = 8.5977999999999781
(gdb) print ((double *)$r14)[8]
$7 = 0
(gdb) print ((double *)$r14)[9]
$8 = 0
(gdb) print ((double *)$r14)[10]
$9 = 0
(gdb) print ((double *)$r14)[16]
$10 = 0
(gdb) print ((double *)$r14)[17]
$11 = 0
(gdb) print ((double *)$r14)[20]
$12 = 0

Which indicate the call should succeed. Is there any probably for vmovapd when the final address is aligned, but the value in the register is not aligned ? Also what is the meaning of k2=0 ?

thank you very much

Vladimir Dergachev

PS This is Xeon Phi stepping B0, 8 GB RAM, passively cooled, the temperature never exceeded 70 degrees, plenty of unused RAM.

Vladimir_Dergachev · ‎05-23-2013

Update: adding -no-vec option produces code correctly working on Xeon Phi.

Also, when the segfault occurs (with vectorization enabled) the linux kernel reports it as "error 4".

Bernard · ‎05-23-2013

What is r11,r14 content during execution of this line of code 0x00000000004cfa5e <+42302>: vmovapd 0x8(%r11,%r14,1),%zmm30{%k3}

Vladimir_Dergachev · ‎05-24-2013

(gdb) print $r11
$1 = 0
(gdb) print $r14
$2 = 140726616854456

Frances_R_Intel · ‎05-24-2013

If you can make a small version of your program that demonstrates the problem, I will pass it on to the developers. I know this might not be an easy thing to do.

Vladimir_Dergachev · ‎05-24-2013

I'll keep an eye on it. For now I switched to using -no-vec and using #pragma simd in places which need optimization.

The confusing issue is what exactly causes the segfault. Given that the the code runs for a while it is likely that just isolating the code with the segfault will not help.

Frances_R_Intel · ‎05-24-2013

-no-vec is a pretty big hammer to use as a work around. Have you tried simplifying the other options you are using on the command line? I don't know what all of your defines are for (seeing " -DUSE_SSE=0 -DUSE_AVX=0 -DUSE_MIC=0" all on the same compile line does seem strange to me) but as far as the Intel compiler options, have you tried limiting them to just "-openmp -mmic"?

Vladimir_Dergachev · ‎05-24-2013

Frances Roth (Intel) wrote:

-no-vec is a pretty big hammer to use as a work around. Have you tried simplifying the other options you are using on the command line? I don't know what all of your defines are for (seeing " -DUSE_SSE=0 -DUSE_AVX=0 -DUSE_MIC=0" all on the same compile line does seem strange to me) but as far as the Intel compiler options, have you tried limiting them to just "-openmp -mmic"?

The defines are just to turn on and off sections of hand-coded intrinsics in our code.

Yes, I tried running with just -mmic -openmp, same thing.

Vladimir_Dergachev · ‎05-24-2013

Are there any more fine-grained ways to limit automatic vectorization besides -no-vec ? Maybe this will shed some light on the problem..

Frances_R_Intel · ‎05-28-2013

You can disable vectorization on a per loop basis using !DIR$ NOVECTOR.

You can provide the compiler with more information about the nature of an individual loop that you want vectorized using !DIR$ VECTOR [clause[,clause[,..]]] where clause can be things like UNALIGNED to warn the compiler not to make assumptions about data alignment when vectorizing the loop. (This is one of the things that can cause memory addressing to go wrong inside a loop.)

You can use !DIR$ ATTRIBUTES VECTOR [:clause] on an individual loop where clause can be things like NOMASK. You had asked what the k2 was for - it is a mask. The section of code you copied doesn't show k2 being set, so I don't know why it is 0.

These directives have other options you can play around with but I would try the UNALIGNED and then the NOMASK first and finally the NOVECTOR if necessary.

Let me know what happens.

Vladimir_Dergachev · ‎05-28-2013

Great, thanks !

#pragma vector unaligned sprinkled in the function where segfault happens fixed the problem. I wonder whether this has something to do with the function being a member of C++ class - most of our other functions are plain C.

Frances_R_Intel · ‎05-28-2013

My apologies for answering you in Fortran; I'm glad you were able to translate it into C/C++. Fortran is my native language and sometimes I slip up. And yes, it might have something to do which the function being part of a C++ class. But, as I said, my native language is Fortran. So I will leave it to other to expand on that subject.

Bernard · ‎05-29-2013

Sorry for late answer.I see that the problem was resolved.

Vladimir_Dergachev · ‎05-29-2013

Fortran has its advantages ;)