Hi Gabriele,

djunglas · ‎01-29-2016

Hello,

I have this piece of code

asm("nop"); asm("nop"); asm("nop"); asm("nop");
double const c = a - b;
double const d = -c;
double const f = d * e;
data->field += f;
asm("nop"); asm("nop"); asm("nop"); asm("nop");
#if !defined(BAD)
printf ("#### new field at %d: %f/%f/%llx\n", __LINE__,
        data->field, data->field, *(unsigned long long *), &data->field));
printf ("#### %f/%g/%llx %f/%g/%llx %f/%g/%llx %f/%g/%llx\n",
       data->field, data->field, *(unsigned long long *), &data->field,
       a, a, *(unsigned long long *), &a, b, b, *(unsigned long long *, &b),
       e, e, *(unsigned long long *), &e);
#endif

So d is basically b-a. However, if b-a is not representable my applications requires d to be an upper bound on the exact value. Therefore the whole code is executed with floating point rounding mode FE_DOWNWARD and it is not ok to directly compute d=b-a (which would give a lower bound on the exact value).

If BAD is not defined then I get this assembly code:

  610141:       90                      nop
  610142:       90                      nop
  610143:       90                      nop
  610144:       90                      nop
  610145:       f2 0f 10 44 24 40       movsd  0x40(%rsp),%xmm0
  61014b:       f2 0f 5c 44 24 48       subsd  0x48(%rsp),%xmm0
  610151:       0f 57 05 08 63 97 00    xorps  0x976308(%rip),%xmm0        # f86460 <.L_2il0floatpacket.43+0x80>
  610158:       f2 0f 59 44 24 10       mulsd  0x10(%rsp),%xmm0
  61015e:       f2 41 0f 58 44 24 20    addsd  0x20(%r12),%xmm0
  610165:       f2 41 0f 11 44 24 20    movsd  %xmm0,0x20(%r12)
  61016c:       90                      nop
  61016d:       90                      nop
  61016e:       90                      nop
  61016f:       90                      nop

This is more or less a literal translation of the code in C and my application works correct in this case. If I define BAD then I get this assembly code instead:

  610086:       90                      nop
  610087:       90                      nop
  610088:       90                      nop
  610089:       90                      nop
  61008a:       f2 0f 10 44 24 38       movsd  0x38(%rsp),%xmm0
  610090:       f2 0f 5c 44 24 40       subsd  0x40(%rsp),%xmm0
  610096:       0f 57 05 b3 61 97 00    xorps  0x9761b3(%rip),%xmm0        # f86250 <.L_2il0floatpacket.43+0x80>
  61009d:       0f 57 05 ac 61 97 00    xorps  0x9761ac(%rip),%xmm0        # f86250 <.L_2il0floatpacket.43+0x80>
  6100a4:       f2 0f 59 04 24          mulsd  (%rsp),%xmm0
  6100a9:       f2 41 0f 58 44 24 20    addsd  0x20(%r12),%xmm0
  6100b0:       f2 41 0f 11 44 24 20    movsd  %xmm0,0x20(%r12)
  6100b7:       90                      nop
  6100b8:       90                      nop
  6100b9:       90                      nop
  6100ba:       90                      nop

and my application does not behave as expected (it computes wrong results). The two xorps statements already look suspicious to me. As far as I understand they perform two xor with the same value, hence are essentially a nop with respect to the result in xmm0. Single stepping through the code in gdb I can see that in the case with BAD not defined:

(gdb) info registers rsp 
rsp            0x7fffffffa010	0x7fffffffa010
(gdb) print *(double *)(0x7fffffffa010 + 0x40)
$1 = 5000000
(gdb) print *(double *)(0x7fffffffa010 + 0x48)
$2 = 0

while with BAD defined I get

(gdb) info registers rsp
rsp            0x7fffffffa020	0x7fffffffa020
(gdb) print *(double *)(0x7fffffffa020 + 0x38)
$2 = 0
(gdb) print *(double *)(0x7fffffffa020 + 0x40)
$3 = 5000000

Thus, without BAD the code computes d as stated in the C code, while with BAD the code computes d directly as d=b-a. The latter will round inexact values into the wrong direction which will in turn produce incorrect results in my application.

I am using

icc (ICC) 12.1.5 20120612
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.

I compile with

-O -fno-builtin-strlen -fno-builtin-strcat -fno-builtin-strcmp -fno-builtin-strcpy -fno-builtin-strncat -fno-builtin-strncmp -fno-builtin-strrchr -m64 -fPIC -fno-strict-aliasing -diag-disable 1419 -w1 -Wcheck -Wall -Wmissing-declarations -Wmissing-prototypes -Wshadow -vec-report0 -fp-model strict

I have

#pragma fenv_access(on)

at the top-level of my source code.

Am I missing anything here or is this indeed an invalid optimization?

Thanks,

Daniel

Gabriele_J_Intel · ‎01-31-2016

Hi Daniel,

You are correct about the observed behavior. It is doing the

optimization you complain about, directly computing "d = b - a".

I cannot judge at the moment if there is really an illegal optimization, particularly since you are not using parens.

Could you try using parens? This should prevent the compiler from reassociation, with the flags that you are using.

An interesting article on this subject is here:

https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

If this does not help, I can dig into this further. A test case that reproduces your issue would help.

Regards,

Gabriele

djunglas · ‎01-31-2016

Hi Gabriele,

thank you for looking into this. I am not clear what you mean by "using parens"? Where should I put parens? All my expressions have at most two operands so I am unclear where to put them.

After reading the article you pointed to I am pretty sure that the compiler is not allowed to perform this optimization. I am compiling with '-fp-model strict' which (according to the documentation) implies '-fp-model precise'. For this setting the article explicitly says:

In SAFE (precise) mode, the compiler may not make any transformations that could affect the result.

Obviously the result is altered if the rounding mode is FE_DOWNWARD, so the optimization is illegal with my flags/pragmas.

I already tried to create a small test case but so far I failed. I will try again but I am not optimistic. So it would be nice if you could look into this even without a small test case.

Thanks,

Daniel

djunglas · ‎02-01-2016

Hi Gabriele,

I sent you a private message with a small test case that reproduces the problem.

Daniel

TimP · ‎02-01-2016

Did you check whether -fp-model strict is having the desired effect? With so many options in the command string, some not spelled in accordance with current documentation, I would have some concerns.

djunglas · ‎02-01-2016

At least "-fp-model strict" is processed by the compiler. If I drop that from the command line I get this error for '#pragma fenv_access(on)':

error: fenv_access cannot be enabled except in precise, source, double, and extended modes
#pragma fenv_access(on)

It clearly does not have the desired effect since according to the documentation it should imply '-fp-model precise' which should in turn avoid exactly the optimization that causes my headaches here.

I have stripped the compiler options to

-O -m64 -fPIC -fno-strict-aliasing -fp-model strict

but the behavior is still the same. I have also added '#pragma float_control(precise,on)' and '#pragma fp_contract(off)' (added them one at a time and simultaneously). That did not change anything either.

TimP · ‎02-01-2016

In icc 16.x there is also -Qprotect-parens which seems to be possibly more strict about associativity than /fp:strict.

Gabriele_J_Intel · ‎02-12-2016

Hi Daniel,

I have reproduced the problem with a small test case and have escalated it to engineering. It is under investigation and I will keep you posted on the progress.

Regards,

Gabriele

djunglas · ‎02-18-2016

With a lot of help from Gabriele we have established the following:

This is indeed a bug in the version of icc we use.
The invalid optimization does not happen when compiling with -O0. For our application this is however not an option since it degrades performance significantly -- even if we refactor our code so that -O0 can be applied only to the functions that actually change the floating point rounding mode.
The invalid optimization does not happen when compiling with -mieee-fp. Like -O0 this has a negative impact on performance but this impact is a lot smaller. If nothing else helps we may be willing to accept this small degradation.

djunglas · ‎06-14-2016

Sorry to come back to this. I am now facing the same problem on Windows. But on Windows option -mieee-fp does not exist.

With -fp:precise or -fp:strict the problem persists (even if I put '#pragam fenv_access(on)' into the code).

Again, the problem goes away if I compile with -Od (equivalent to -O0 on Windows) but the performance degradation of this is too big. So what is the equivalent of -mieee-fp on Windows or what is the flag to use to avoid this bad optimization on Windows? I am still on version 'icc (ICC) 12.1.5 20120612'.

Thanks a lot!

Daniel

Anoop_M_Intel · ‎06-14-2016

Hi Daniel

Please try the following option on Windows: https://software.intel.com/en-us/node/525037.

Thanks and Regards
ANoop

Anoop_M_Intel · ‎06-14-2016

Just did a sanity check for /fltconsistency option on Windows but looks like this support is only for Fortran and not for C++. So please neglect my previous comment.

Thanks and Regards
Anoop

jimdempseyatthecove · ‎06-15-2016

The good news is Intel has a reproducer, the bad news is djunglas does not know if this same bug appears elsewhere in his(her) code.

If the bug is localized, this may optimize and work satisfactorly:

#if 0
double const c = a - b;
double const d = -c;
#else
double const d = b - a;
#endif
double const f = d * e;
data->field += f;
// **** if c required later
double const c = -d;

Jim Dempsey

Invalid floating point optimization?