- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a very baffling segfault in automatically vectorized code, perhaps someone have seen it before and can offer some suggestions..
First of all, this is a plain C++ code, compiled with icpc using the following switches:
-fopenmp -O3 -mmic -openmp -fma -inline-debug-info -fp-model fast -DUSE_SSE=0 -DUSE_AVX=0 -DUSE_MIC=0 -DUSE_PFLOAT=1 -DUSE_RFLOAT=1 -DDEBUG=0 -mcmodel=medium -debug parallel
If I drop -mmic the code runs fine on Xeon E5 processor. I can see the error also with -O2 and -O. The code executes for a while running the same instructions before the segfault occurs.
If I run the code on Xeon Phi it crashes in similar looking pieces of C code. The time it crashes varies with the number of threads and compile settings. I have seen it crash when running single-threaded (but this takes a long time).
C code:
sum = 0.0;
cnt = 0;
for(register int i = shift; i < K; i++){
if(outlier || outlier[i-shift])
continue;
register double y = Yfrag[i-shift];
register double x = Xfrag;
register double vary = varF + var*(x+y);
register double err = x-y;
if(DEBUG) assert(vary > 0.0);
sum += err*err/vary;
cnt++;
}
I have tried to remove register keyword thinking that this confuses the compiler, with no change in segfault.
Using gdb, the crash occurs in assembler instruction corresponding to
register double x = Xfrag;
the assembler dump from gdb is
0x00000000004cfa54 <+42292>: movabs $0x63f680,%r13
0x00000000004cfa5e <+42302>: vmovapd 0x8(%r11,%r14,1),%zmm30{%k3}
0x00000000004cfa69 <+42313>: kxnor %k0,%k0
0x00000000004cfa6d <+42317>: vgatherdpd 0x40(%rcx,%zmm24,8),%zmm31{%k5}
0x00000000004cfa75 <+42325>: jkzd 0x4cfa87 <Calign::IsRepeatRegion()+42343>,%k5
0x00000000004cfa7a <+42330>: vgatherdpd 0x40(%rcx,%zmm24,8),%zmm31{%k5}
0x00000000004cfa82 <+42338>: jknzd 0x4cfa6d <Calign::IsRepeatRegion()+42317>,%k5
0x00000000004cfa87 <+42343>: vpxorq %zmm2,%zmm2,%zmm2
0x00000000004cfa8d <+42349>: vaddpd %zmm30,%zmm29,%zmm11
=> 0x00000000004cfa93 <+42355>: vmovapd 0x48(%r11,%r14,1),%zmm2{%k2}
0x00000000004cfa9e <+42366>: vsubpd %zmm29,%zmm30,%zmm25
0x00000000004cfaa4 <+42372>: vaddpd %zmm2,%zmm31,%zmm3
0x00000000004cfaaa <+42378>: vpermf32x4 $0x0,%zmm27,%zmm1
0x00000000004cfab1 <+42385>: vpermf32x4 $0x0,%zmm28,%zmm12
0x00000000004cfab8 <+42392>: vsubpd %zmm31,%zmm2,%zmm26
A particular puzzle to me is what exactly causes a segfault. The register values are
(gdb) print $r11
$1 = 0
(gdb) print $r14
$2 = 140726616854456
(gdb) print $zmm2
$3 = {v16_float = {0 <repeats 16 times>}, v8_double = {0, 0, 0, 0, 0, 0, 0, 0}, v64_int8 = {0 <repeats 64 times>}, v32_int16 = {0 <repeats 32 times>}, v16_int32 = {0 <repeats 16 times>}, v8_int64 = {0, 0, 0,
0, 0, 0, 0, 0}, v4_uint128 = {0, 0, 0, 0}}
(gdb) print $k2
$4 = 0
(gdb) print ((double *)$r14)[0]
$5 = 5.4553999999999974
(gdb) print ((double *)$r14)[1]
$6 = 8.5977999999999781
(gdb) print ((double *)$r14)[8]
$7 = 0
(gdb) print ((double *)$r14)[9]
$8 = 0
(gdb) print ((double *)$r14)[10]
$9 = 0
(gdb) print ((double *)$r14)[16]
$10 = 0
(gdb) print ((double *)$r14)[17]
$11 = 0
(gdb) print ((double *)$r14)[20]
$12 = 0
Which indicate the call should succeed. Is there any probably for vmovapd when the final address is aligned, but the value in the register is not aligned ? Also what is the meaning of k2=0 ?
thank you very much
Vladimir Dergachev
PS This is Xeon Phi stepping B0, 8 GB RAM, passively cooled, the temperature never exceeded 70 degrees, plenty of unused RAM.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update: adding -no-vec option produces code correctly working on Xeon Phi.
Also, when the segfault occurs (with vectorization enabled) the linux kernel reports it as "error 4".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is r11,r14 content during execution of this line of code 0x00000000004cfa5e <+42302>: vmovapd 0x8(%r11,%r14,1),%zmm30{%k3}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(gdb) print $r11
$1 = 0
(gdb) print $r14
$2 = 140726616854456
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you can make a small version of your program that demonstrates the problem, I will pass it on to the developers. I know this might not be an easy thing to do.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll keep an eye on it. For now I switched to using -no-vec and using #pragma simd in places which need optimization.
The confusing issue is what exactly causes the segfault. Given that the the code runs for a while it is likely that just isolating the code with the segfault will not help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-no-vec is a pretty big hammer to use as a work around. Have you tried simplifying the other options you are using on the command line? I don't know what all of your defines are for (seeing " -DUSE_SSE=0 -DUSE_AVX=0 -DUSE_MIC=0" all on the same compile line does seem strange to me) but as far as the Intel compiler options, have you tried limiting them to just "-openmp -mmic"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
-no-vec is a pretty big hammer to use as a work around. Have you tried simplifying the other options you are using on the command line? I don't know what all of your defines are for (seeing " -DUSE_SSE=0 -DUSE_AVX=0 -DUSE_MIC=0" all on the same compile line does seem strange to me) but as far as the Intel compiler options, have you tried limiting them to just "-openmp -mmic"?
The defines are just to turn on and off sections of hand-coded intrinsics in our code.
Yes, I tried running with just -mmic -openmp, same thing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are there any more fine-grained ways to limit automatic vectorization besides -no-vec ? Maybe this will shed some light on the problem..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can disable vectorization on a per loop basis using !DIR$ NOVECTOR.
You can provide the compiler with more information about the nature of an individual loop that you want vectorized using !DIR$ VECTOR [clause[,clause[,..]]] where clause can be things like UNALIGNED to warn the compiler not to make assumptions about data alignment when vectorizing the loop. (This is one of the things that can cause memory addressing to go wrong inside a loop.)
You can use !DIR$ ATTRIBUTES VECTOR [:clause] on an individual loop where clause can be things like NOMASK. You had asked what the k2 was for - it is a mask. The section of code you copied doesn't show k2 being set, so I don't know why it is 0.
These directives have other options you can play around with but I would try the UNALIGNED and then the NOMASK first and finally the NOVECTOR if necessary.
Let me know what happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Great, thanks !
#pragma vector unaligned sprinkled in the function where segfault happens fixed the problem. I wonder whether this has something to do with the function being a member of C++ class - most of our other functions are plain C.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apologies for answering you in Fortran; I'm glad you were able to translate it into C/C++. Fortran is my native language and sometimes I slip up. And yes, it might have something to do which the function being part of a C++ class. But, as I said, my native language is Fortran. So I will leave it to other to expand on that subject.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for late answer.I see that the problem was resolved.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fortran has its advantages ;)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page