I'm using ICC version 2016.1.150. I found some drawbacks when compiling for processors endowed with AVX2. When compiling with O0 and -xcore-avx2, code runs without problems. However, when compiling with O1 and -xcore-avx2, code runs, but presents a wrong result.
Furthermore, when compiling with O1 (O2 or O3) and -xavx, code produces a right result. In other words, I think ICC is producing an optimized assembly code that affects program execution behavior and, consequently, the final result.
At last, I tested with GCC 4.8.2 and did not find any problem when compiling with -mavx2 and O3.
Also, if it involves floating point consistency issues, then you should use the option "-fimf-arch-consistency=true" option and try it out. This option will give the same results on all processors of the same architecture as well.
Thanks Kittur and Tim for replying my question.
I tried a few minutes ago compiling with "-fimf-arch-consistency=true", but, unfortunately, results still the same. It worth to mention that O1 flag does not enable auto-vectorization. Thus, I can not understand why program results changes when using -xavx or -xcore-avx2 which are instructions set for vectorization.
More specifically, my application performs several floating point calculations. I notice some calculations results in a not a number (-nan) when compiling with -xcore-avx2, while results are correct when compiling with -xavx or "-xcore-avx2 and O0".
Another doubt: what are optimization flags enabled by O1? I read ICC manual and did not find this information. If I knew that information, I could disable one-by-one in order to find out what optimization option is causing the problem.
Targeting a newer instruction set than your CPU supports is always a problem even without vectorization. For example, avx2 optimization with fma applies to scalar operations and would give the illegal instructions fault up through ivy bridge.
I try never to set sse4.2 when sse4.1 runs faster on westmere or avx-i which has no advantage other than to fault when run on sandy bridge.
Without knowing what the test case is like it's hard to see what's going on. If FMAs are involved, then an FMA may give a different result from multiply and addition operations. The compiler would try to use an FMA replacing a multiply and add operations for example at -O1 and above in addition to avx2 etc leading to different results at different optimizations. The only solution for such a scenario is to disable FMA explicitly.
You can try the following options and give it a shot:
1) -fp-model precise -no-fma -fimf-arch-consistency=true
2) -fp-model strict (also disables but has wider impact)
If the above still doesn't resolve then you should attach a small reproducer for us to try out and file an issue accordingly - appreciate much.
Hi Kittur and Tim!
It works when compiling with -xcore-avx2 -no-fma -prec-div -prec-sqrt.
As you said, I think is a problem related to FMA e float precision.
I will test another options (fp-model) in order to see whether performance rises.
Thank you very much!
Great, glad to know it resolved your issue. In the next 17.0 compiler release there'll be just one switch with those options that you can use to achieve the same with disabling FMA, fyi.
It seems in my testing of 17 that prec-div has come on unexpectedly. I couldn't find documentation of any changes there. Typical that beta test features don't get documented in time.
Hi Tim, can you file a separate issue (like in IPS) against the beta product on the doc change you're referring to so it's triaged and filed with the developers and recorded as part of the beta, appreciate much.