I have this piece of code
asm("nop"); asm("nop"); asm("nop"); asm("nop"); double const c = a - b; double const d = -c; double const f = d * e; data->field += f; asm("nop"); asm("nop"); asm("nop"); asm("nop"); #if !defined(BAD) printf ("#### new field at %d: %f/%f/%llx\n", __LINE__, data->field, data->field, *(unsigned long long *), &data->field)); printf ("#### %f/%g/%llx %f/%g/%llx %f/%g/%llx %f/%g/%llx\n", data->field, data->field, *(unsigned long long *), &data->field, a, a, *(unsigned long long *), &a, b, b, *(unsigned long long *, &b), e, e, *(unsigned long long *), &e); #endif
So d is basically b-a. However, if b-a is not representable my applications requires d to be an upper bound on the exact value. Therefore the whole code is executed with floating point rounding mode FE_DOWNWARD and it is not ok to directly compute d=b-a (which would give a lower bound on the exact value).
If BAD is not defined then I get this assembly code:
610141: 90 nop 610142: 90 nop 610143: 90 nop 610144: 90 nop 610145: f2 0f 10 44 24 40 movsd 0x40(%rsp),%xmm0 61014b: f2 0f 5c 44 24 48 subsd 0x48(%rsp),%xmm0 610151: 0f 57 05 08 63 97 00 xorps 0x976308(%rip),%xmm0 # f86460 <.L_2il0floatpacket.43+0x80> 610158: f2 0f 59 44 24 10 mulsd 0x10(%rsp),%xmm0 61015e: f2 41 0f 58 44 24 20 addsd 0x20(%r12),%xmm0 610165: f2 41 0f 11 44 24 20 movsd %xmm0,0x20(%r12) 61016c: 90 nop 61016d: 90 nop 61016e: 90 nop 61016f: 90 nop
This is more or less a literal translation of the code in C and my application works correct in this case. If I define BAD then I get this assembly code instead:
610086: 90 nop 610087: 90 nop 610088: 90 nop 610089: 90 nop 61008a: f2 0f 10 44 24 38 movsd 0x38(%rsp),%xmm0 610090: f2 0f 5c 44 24 40 subsd 0x40(%rsp),%xmm0 610096: 0f 57 05 b3 61 97 00 xorps 0x9761b3(%rip),%xmm0 # f86250 <.L_2il0floatpacket.43+0x80> 61009d: 0f 57 05 ac 61 97 00 xorps 0x9761ac(%rip),%xmm0 # f86250 <.L_2il0floatpacket.43+0x80> 6100a4: f2 0f 59 04 24 mulsd (%rsp),%xmm0 6100a9: f2 41 0f 58 44 24 20 addsd 0x20(%r12),%xmm0 6100b0: f2 41 0f 11 44 24 20 movsd %xmm0,0x20(%r12) 6100b7: 90 nop 6100b8: 90 nop 6100b9: 90 nop 6100ba: 90 nop
and my application does not behave as expected (it computes wrong results). The two xorps statements already look suspicious to me. As far as I understand they perform two xor with the same value, hence are essentially a nop with respect to the result in xmm0. Single stepping through the code in gdb I can see that in the case with BAD not defined:
(gdb) info registers rsp rsp 0x7fffffffa010 0x7fffffffa010 (gdb) print *(double *)(0x7fffffffa010 + 0x40) $1 = 5000000 (gdb) print *(double *)(0x7fffffffa010 + 0x48) $2 = 0
while with BAD defined I get
(gdb) info registers rsp rsp 0x7fffffffa020 0x7fffffffa020 (gdb) print *(double *)(0x7fffffffa020 + 0x38) $2 = 0 (gdb) print *(double *)(0x7fffffffa020 + 0x40) $3 = 5000000
Thus, without BAD the code computes d as stated in the C code, while with BAD the code computes d directly as d=b-a. The latter will round inexact values into the wrong direction which will in turn produce incorrect results in my application.
I am using
icc (ICC) 12.1.5 20120612
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.
I compile with
-O -fno-builtin-strlen -fno-builtin-strcat -fno-builtin-strcmp -fno-builtin-strcpy -fno-builtin-strncat -fno-builtin-strncmp -fno-builtin-strrchr -m64 -fPIC -fno-strict-aliasing -diag-disable 1419 -w1 -Wcheck -Wall -Wmissing-declarations -Wmissing-prototypes -Wshadow -vec-report0 -fp-model strict
at the top-level of my source code.
Am I missing anything here or is this indeed an invalid optimization?
You are correct about the observed behavior. It is doing the
optimization you complain about, directly computing "d = b - a".
I cannot judge at the moment if there is really an illegal optimization, particularly since you are not using parens.
Could you try using parens? This should prevent the compiler from reassociation, with the flags that you are using.
An interesting article on this subject is here:
If this does not help, I can dig into this further. A test case that reproduces your issue would help.
thank you for looking into this. I am not clear what you mean by "using parens"? Where should I put parens? All my expressions have at most two operands so I am unclear where to put them.
After reading the article you pointed to I am pretty sure that the compiler is not allowed to perform this optimization. I am compiling with '-fp-model strict' which (according to the documentation) implies '-fp-model precise'. For this setting the article explicitly says:
In SAFE (precise) mode, the compiler may not make any transformations that could affect the result.
Obviously the result is altered if the rounding mode is FE_DOWNWARD, so the optimization is illegal with my flags/pragmas.
I already tried to create a small test case but so far I failed. I will try again but I am not optimistic. So it would be nice if you could look into this even without a small test case.
Did you check whether -fp-model strict is having the desired effect? With so many options in the command string, some not spelled in accordance with current documentation, I would have some concerns.
At least "-fp-model strict" is processed by the compiler. If I drop that from the command line I get this error for '#pragma fenv_access(on)':
error: fenv_access cannot be enabled except in precise, source, double, and extended modes
It clearly does not have the desired effect since according to the documentation it should imply '-fp-model precise' which should in turn avoid exactly the optimization that causes my headaches here.
I have stripped the compiler options to
-O -m64 -fPIC -fno-strict-aliasing -fp-model strict
but the behavior is still the same. I have also added '#pragma float_control(precise,on)' and '#pragma fp_contract(off)' (added them one at a time and simultaneously). That did not change anything either.
I have reproduced the problem with a small test case and have escalated it to engineering. It is under investigation and I will keep you posted on the progress.
With a lot of help from Gabriele we have established the following:
- This is indeed a bug in the version of icc we use.
- The invalid optimization does not happen when compiling with -O0. For our application this is however not an option since it degrades performance significantly -- even if we refactor our code so that -O0 can be applied only to the functions that actually change the floating point rounding mode.
- The invalid optimization does not happen when compiling with -mieee-fp. Like -O0 this has a negative impact on performance but this impact is a lot smaller. If nothing else helps we may be willing to accept this small degradation.
Sorry to come back to this. I am now facing the same problem on Windows. But on Windows option -mieee-fp does not exist.
With -fp:precise or -fp:strict the problem persists (even if I put '#pragam fenv_access(on)' into the code).
Again, the problem goes away if I compile with -Od (equivalent to -O0 on Windows) but the performance degradation of this is too big. So what is the equivalent of -mieee-fp on Windows or what is the flag to use to avoid this bad optimization on Windows? I am still on version 'icc (ICC) 12.1.5 20120612'.
Thanks a lot!
Just did a sanity check for /fltconsistency option on Windows but looks like this support is only for Fortran and not for C++. So please neglect my previous comment.
Thanks and Regards
The good news is Intel has a reproducer, the bad news is djunglas does not know if this same bug appears elsewhere in his(her) code.
If the bug is localized, this may optimize and work satisfactorly:
#if 0 double const c = a - b; double const d = -c; #else double const d = b - a; #endif double const f = d * e; data->field += f; // **** if c required later double const c = -d;