Solved: Thanks for the hint, Jim.

djunglas · ‎05-05-2014

Hello,

I have run into something that looks like icc is generating invalid optimized code. Basically, I have the following code

  feenableexcept(FE_DIVBYZERO);
  ...
   if ( chg4 ) {
      printf("Change 4\n");
      for (j = 0; j < len2; ++j) {
         if ( chg4 > 0 )
            maxpenalty = XMAX (maxpenalty, 1.0 / chg4);
         if ( chg4 <= 0.0 || base->data->d2 >= 1e20 )
            continue;
         maplen++;
         totlen1++;
      }
   }

Here 'chg4' is an array of doubles, all at 0, len2 is the length of the array and XMAX is a macro computing the max of its arguments.

If I compile this code with 'icc -g' and run it, then everything works as expected. However, when I compile it with 'icc -O' or just 'icc' then running the code throws a floating point exception (division by zero). In gdb I get this

(gdb) run
...
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400fd9 in wrapper ()
(gdb) disassemble
...
   0x0000000000400fbb <+747>:	movaps 0x214e(%rip),%xmm7        # 0x403110
   0x0000000000400fc2 <+754>:	movaps 0x2157(%rip),%xmm0        # 0x403120
   0x0000000000400fc9 <+761>:	movslq %ecx,%rdi
   0x0000000000400fcc <+764>:	movaps %xmm2,%xmm10
   0x0000000000400fd0 <+768>:	movaps %xmm5,%xmm11
   0x0000000000400fd4 <+772>:	movaps (%r8,%rdi,8),%xmm9
=> 0x0000000000400fd9 <+777>:	divpd  %xmm9,%xmm10
   0x0000000000400fde <+782>:	cmpltpd %xmm9,%xmm11
   0x0000000000400fe4 <+788>:	cmplepd %xmm5,%xmm9
...
(gdb) p $xmm9
$1 = {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {
    0 <repeats 16 times>}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 
    0, 0, 0}, v2_int64 = {0, 0}, uint128 = 0}
(gdb) p $xmm10
$2 = {v4_float = {0, 1.875, 0, 1.875}, v2_double = {1, 1}, v16_int8 = {0, 0, 
    0, 0, 0, 0, -16, 63, 0, 0, 0, 0, 0, 0, -16, 63}, v8_int16 = {0, 0, 0, 
    16368, 0, 0, 0, 16368}, v4_int32 = {0, 1072693248, 0, 1072693248}, 
  v2_int64 = {4607182418800017408, 4607182418800017408}, 
  uint128 = 0x3ff00000000000003ff0000000000000}
(gdb)

I took a quick look at the generated assembler code and it looks like the offending divpd instruction is in a part that corresponds to an optimized version of the loop above and the code indeed attempts to compute 1.0/chg4, thereby producing a division by zero. I think that this is a bug since my source code explicitly checks that we never do a division if the denominator of the quotient would be zero.

Here is information about my environment and how I build things:

djunglas@MACHINE:~/fpebug> uname -a
Linux MACHINE 3.0.80-0.7-default #1 SMP Tue Jun 25 18:32:49 UTC 2013 (25740f8) x86_64 x86_64 x86_64 GNU/Linux
djunglas@MACHINE:~/fpebug> $ICCPATH/12.1/composer_xe_2011_sp1.11.339/bin/intel64/icc --version
icc (ICC) 12.1.5 20120612
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

djunglas@MACHINE:~/fpebug> $ICCPATH/12.1/composer_xe_2011_sp1.11.339/bin/intel64/icc -O -c -o main.o main.c
djunglas@MACHINE:~/fpebug> $ICCPATH/12.1/composer_xe_2011_sp1.11.339/bin/intel64/icc -O -c -o function.o function.c
djunglas@MACHINE:~/fpebug> objdump -D -r function.o > function.txt
djunglas@MACHINE:~/fpebug> $ICCPATH/12.1/composer_xe_2011_sp1.11.339/bin/intel64/icc -o fpebug main.o function.o
djunglas@MACHINE:~/fpebug> objdump -D -r fpebug > fpebug.txt
djunglas@MACHINE:~/fpebug> ./fpebug
Change 4
Floating point exception

I have attached the source code as well as object dumps of the object and the binary file. I would be very happy if someone could tell me whether this is expected behavior or indeed a bug in icc. I would also be happy to learn a way to work around this problem. Disabling FE_DIVBYZERO is not an option right now. Also, I would also like to keep -O.

Thanks a lot,

Daniel

Tom_T_ · ‎05-08-2014

Try the -fp_speculation=safe option

View solution in original post

TimP · ‎05-05-2014

If you wish to avoid optimization by inversion, you must set an option which implies -prec-div, such as -fp-model with setting other than fast. If you are used to gcc, you must consider that many aggressive optimizations which gcc reserves for -ffast-math come in with icc -O, unless you set -fp-model source.

djunglas · ‎05-05-2014

Thank you for your help Tim.

Indeed, compiling with '-fp-model' set to strict, precise or source seems to fix the problem. However, compiling with just '-prec-div' or '-O -prec-div' does not fix things. I still get those floating point exceptions, shouldn't -prec-div' avoid that if inversion is the issue here? Also, I am not clear why inversion would be at play here. Isn't the real problem that the code is executed at all?

Another related question: Is there a way to figure out which options are implied by -O or -O2? The man page is somewhat vague on that topic. From the man page I would not have inferred that -O implies -no-prec-div.

Thanks,

Daniel

TimP · ‎05-05-2014

The gradual underflow setting may be implicated. -no-ftz is included in those -fp-model settings. fp-model fast is implied by -O2, this includes -no-prec-div -no-prec-sqrt -ftz and other aggressive settings No-ftz should have no performance impact on cpus with avx support but the option default isn't keyed to architecture setting. Ftz or no-ftz has effect only for main ().

djunglas · ‎05-06-2014

Not sure if I got you correctly but in any case, compiling with just -no-ftz as the only compiler flag does not fix the problem.

I took a closer look at the assembler code that icc generates for the offending loop

      for (j = 0; j < len2; ++j) {
         if ( chg4 > 0 )
            maxpenalty = XMAX (maxpenalty, 1.0 / chg4);
         if ( chg4 <= 0.0 || base->data->d2 >= 1e20 )
            continue;
         maplen++;
         totlen1++;
      }

If optimization is allowed then icc splits this up into three parts:

Process elements from chg4[] one at a time until &chg4 is properly aligned.
Now process elements in chg4[] two at a time.
Process any remaining elements in chg4[], again one at a time.

The assembler code for 1 and 3 is pretty straightforward and basically a 1:1 translation of the source code. In particular, the generated code explicitly tests whether chg4>0 before forming the quotient 1.0/chg4. The interesting thing is the generated code for 2 (my assembler is a bit rusty so I hope I got things right). The code for the loop body as emited by icc essentially is

# xmm1 stores maxpenalty here
# xmm5 is { 0.0, 0.0 } here

 2f9:	48 63 f9             	movsxd rdi,ecx                   # ecx, is the loop variable, i.e. j
 2fc:	44 0f 28 d2          	movaps xmm10,xmm2                # load xmm10 with { 1.0, 1.0 }
 300:	44 0f 28 dd          	movaps xmm11,xmm5                # load xmm11 with { 0.0, 0.0 }
 304:	45 0f 28 0c f8       	movaps xmm9,XMMWORD PTR [r8+rdi*8] # load xmm9 with chg4 and chg4[j+1]
 309:	66 45 0f 5e d1       	divpd  xmm10,xmm9                # Compute 1.0/chg4 and 1.0/chg4[j+1] and store result in xmm10
 30e:	66 45 0f c2 d9 01    	cmpltpd xmm11,xmm9               # Compare chg4,chg4[j+1] against 0
 314:	66 44 0f c2 cd 02    	cmplepd xmm9,xmm5                # Compare chg4,chg4[j+1] against 0 (opposite direction as previous line)
 31a:	66 44 0f 5f d1       	maxpd  xmm10,xmm1
 31f:	66 44 0f ef cf       	pxor   xmm9,xmm7
 324:	45 0f 54 d3          	andps  xmm10,xmm11
 328:	66 41 0f 50 f1       	movmskpd esi,xmm9
 32d:	44 0f 55 d9          	andnps xmm11,xmm1
 331:	41 0f 28 ca          	movaps xmm1,xmm10
 335:	41 0f 56 cb          	orps   xmm1,xmm11
 ...
 368:	83 c1 02             	add    ecx,0x2                   # advance j
 ...
 370:	3b ca                	cmp    ecx,edx
 372:	66 45 0f c2 e2 02    	cmplepd xmm12,xmm10
 378:	66 45 0f 70 e9 08    	pshufd xmm13,xmm9,0x8
 37e:	66 45 0f 70 cc 08    	pshufd xmm9,xmm12,0x8
 384:	66 44 0f db eb       	pand   xmm13,xmm3
 389:	66 44 0f db cb       	pand   xmm9,xmm3
 38e:	66 44 0f ef cc       	pxor   xmm9,xmm4
 393:	66 45 0f db e9       	pand   xmm13,xmm9
 398:	66 45 0f fa c5       	psubd  xmm8,xmm13
 39d:	66 41 0f fa f5       	psubd  xmm6,xmm13
 3a2:	0f 82 51 ff ff ff    	jb     2f9 <wrapper+0x2f9>       # continue loop

As you can see, this code always and unconditionally computes 1.0/chg4 and 1.0/chg4[j+1] (see the divpd at 0x309). Only after computing that quotient it compares chg4, chg4[j+1] against 0. The instructions 31a-335 then make sure that the maximum of maxpenalty in xmm1 is correct, even if chg4 or chg4[j+1] are non-positive. So it seems pretty obvious to me that this code will throw a "division by zero" as soon as any element in chg4[] is zero. And I don't see how underflows or inversions would be involved here.

The optimized code emited by icc produces the correct results but has very different side effects than the non-optimized code: it throws a floating point exception. I am worried about this since we have a very huge code base and may have many situations like the above in which we explicitly check for !=0 in the source code so as to avoid "division by zero" exceptions. Since our product is a library we have no control about whether FE_DIVBYZERO is enabled or not. So we must avoid raising this exception.

What is the strategy of icc's code optimizer in situations like the above? Assuming that the original code is guaranteed to not perform any division by 0, does the optimizer make any effort to maintain that invariant?

Tom_T_ · ‎05-08-2014

Try the -fp_speculation=safe option

djunglas · ‎05-19-2014

Hello Tim and Tom,

thank you for your help and sorry for the late reply. I had to run some extensive performance benchmarks to see how the various compiler options suggested by you affect the performance of the optimized code in our library. Here are the results:

Using any of '-fp-model (precise|strict|source)' or '-fp_speculation=safe' fixes the problem, no more FE_DIVBYZERO exceptions.
Performance impact of any of the -fp-model options is quite bad: the code runs up to 4% slower when compiled with this option.
Option -fp_speculation=safe does not seem to have a performance impact on our code.

We will run a few more tests but if nothing else comes up we will go with -fp_speculation=safe. Thank you again for helping me to get rid of this issue.

Daniel

jimdempseyatthecove · ‎05-19-2014

Daniel

You can also insert in front of the offending loop "#pragma novector" and not use safe to the entire program (compilation unit).

Jim Dempsey

djunglas · ‎05-19-2014

Thanks for the hint, Jim.

The problem is that our code is several hundreds of thousands lines long and we may have more places at which this problem could occur. Going through all the code and finding potentially problematic loops is not an option right now, so using safe speculation everywhere seems the safest bet at the moment.

jimdempseyatthecove · ‎05-19-2014

Then you may be throwing out vectorization all together. Earlier you expressed a concern of a loss in performance of 4%.

Are you aware that you can set the Project (default) optimization Properties one way, then on a file by file bases set the file optimization a different way. IOW set Project to safe, run VTune, or other profiler, locate hot spots, then on case-by-case tweak up the optimizations.

At issue here, as you have pointed out, was the compiler optimization performed the division always, then used the compare and mask move (in this case via and/xor/...) to merge the results. Your loop expressly had a test to avoid divide by zero. IMO, I believe that the optimization policy to perform this optimization in a loop containing an if followed by divide is incorrect. In this situation, the programmer should be required to insert a #pragma to activate this optimization (e.g. when the test is not for divide by zero protection and the loop is known not to produce a DIV/0).

IOW in loops containing "if" followed by divide and where vectorization opportunity is discovered that the default policy should be to not vectorize and instead optionally issue informational report of missed vectorization opportunity. Note, this is only for loops containing a divide.

As you have noted, in an application with 100's of thousands of lines it becomes problematic in finding errors introduced due to optimization (especially when the source code is defensive against the error).

Jim Dempsey

Tom_T_ · ‎05-19-2014

Happily, the -fp_speculation=safe option does not preclude vectorization; indeed its purpose is to allow the usual aggressive optimizations except for ones which (in effect) move a potentially trapping operation out of a condition.

Unhappily, the option is poorly documented and (it would appear) little used, probably because code which enables floating point exception trapping is relative rare.

But it works great.

djunglas · ‎05-20-2014

jimdempseyatthecove wrote:

Then you may be throwing out vectorization all together. Earlier you expressed a concern of a loss in performance of 4%.

We only observed this performance hit when changing the floating point model. With safe speculation we did not observe any performance hit.

jimdempseyatthecove wrote:

Are you aware that you can set the Project (default) optimization Properties one way, then on a file by file bases set the file optimization a different way. IOW set Project to safe, run VTune, or other profiler, locate hot spots, then on case-by-case tweak up the optimizations.

Yes, I'm aware of that. We are compiling each file separately in the command line so we can easily change options on a per-file basis.

jimdempseyatthecove wrote:

At issue here, as you have pointed out, was the compiler optimization performed the division always, then used the compare and mask move (in this case via and/xor/...) to merge the results. Your loop expressly had a test to avoid divide by zero. IMO, I believe that the optimization policy to perform this optimization in a loop containing an if followed by divide is incorrect. In this situation, the programmer should be required to insert a #pragma to activate this optimization (e.g. when the test is not for divide by zero protection and the loop is known not to produce a DIV/0).

IOW in loops containing "if" followed by divide and where vectorization opportunity is discovered that the default policy should be to not vectorize and instead optionally issue informational report of missed vectorization opportunity. Note, this is only for loops containing a divide.

I don't know what the strategy is in general is in our special case I think what you say is at least debatable: In our case the numerator is always 1, so division by 0 is always well-defined in IEEE. So in this special case it seems sort of reasonable to assume that performing the division anyway will not cause trouble. Enabling FE_DIVBYZERO looks sort of non-standard to me. I would be happy if Intel changed the compiler in the way you described but I am not sure that the current behavior is really a bug.

djunglas · ‎05-20-2014

Tom T. wrote:

Happily, the -fp_speculation=safe option does not preclude vectorization; indeed its purpose is to allow the usual aggressive optimizations except for ones which (in effect) move a potentially trapping operation out of a condition.

That is very good to know. Thank you for clarifying that.

icc generates code that throws unexpected floating point exception