Intel 11.1 results differ from Intel 10.1

scamicha · ‎06-21-2010

Hello,

I've recently moved a code from Intel 10.1.025 to Intel 11.1.038 and I'm getting different results. I'm assuming that this must be due to different default optimizations. The compiler flags I'm using are

-openmp -O3 -pad -align -fast -shared-intel -r8 -xHost -mcmodel=medium -ipo -convert little_endian

Do any of these do something different in 11.1 vs. 10.1? Also how can I show the OpenMP messages when I compile with 11.1? Thanks!

Ron_Green · ‎06-21-2010

what is included in -fast varies by version. Get rid of it, you have all the other options you need.

With 11.1, you need -openmp-report1 to get the regions successfully parallelized. In 11.1 the default is -openmp-report0 (no information) whereas in 10.1 it was on "1" by default.

If your goal is to get the same numbers on 10.1 as 11.1, try -fp-model precise on both 10.1 and 11.1. Internally, there are a lot of differences in the optimizer between 10.1 and 11.1.

ron

TimP · ‎06-21-2010

No, those options should have the same meaning, except that now you would set
-ooenmp-report1 -vec-report1
to get the same reports which were default in 10.1. Or, you could turn on -opt-report.
There's a slight possibility, since you choose aggressive optimizations, that something comes out different. I normally use -assume protect_parens -prec-div -prec-sqrt to avoid excessively aggressive optimization. Those are included in -fp-model precise, so basically I am saying the same as Ron.
As Ron said, it's probably advisable to avoid redundant options by removing -fast.

scamicha · ‎06-21-2010

I had not been compiling the code using -fp-model precise under 10.1 so I guess it was using the default of fast=1? Using -fp-model precise with 11.1 the code doesn't blow up but the results still differ by ~5% per timestep. Since typical simulations are 1M timesteps this is too large. I'm going to re-run the 10.1 version with -fp-model precise to see if I get the same ~5% difference. Another issue could be that since I'm running on Westmere the 11.1 should be using sse4.2 with -xHost, correct? While under 10.1 this isn't available. I'm unclear as to the effect this could have though...

TimP · ‎06-21-2010

That's an interesting point, what happens if you set -xHost on a processor which is not known to that version of the compiler. If you were running the ia32 compiler, and defaulted to no SSE code (-mia32 for 11.1), that could explain differences.

Izaak_Beekman · ‎06-21-2010

Keep in mind if the code is parallel data reduction steps can often cause differences from run to run, although 5% seems too high for that. [e.g. floating point addition is not associative: a + (b + c) /= (a + b) + c. So certain nodes or processing elements will finish at different times from run to run depending on system load etc. causing the reduction operations to occur in different orders, and hence the answers to differ.]

scamicha · ‎06-21-2010

I've run this code on several different platforms IA64, EMT64 and several different fabrics and not seen variations of this magnitude. It seems directly linked to whatever the -openmp switch is doing. I tried compiling in serial without the -openmp switch under 11.1 and get the same results as 10.1 with -openmp. When using -openmp and -assume protect_parens -prec-div -prec-sqrt I get the ~5%-10% difference. So what is different between the way the 10.1 compiler parallelizes OpenMP loops and the way 11.1 does?

TimP · ‎06-21-2010

If you do have sum reduction, -fp-model source and the like would stop the compiler from making vector reductions per thread before combining sums under OpenMP reduction. I would think the vector sum reduction would normally improve accuracy. If you have threaded sum reduction (not local to each thread) without declaring it correctly under OpenMP, there's no guarantee of correct results with either compiler.

scamicha · ‎06-21-2010

The loop(s) which cause the dramatic error are in fact reduction loops and the bad variable are the reduction variables. However, adding -fp-model precise or some combination of -prec-div -prec-sqrt and -assume protect_parens fix this and leave the ~5%-10% error. At this point I'm not so much concerned whether I've declared the reduction clauses correctly or not, but rather that the compiler does something different from one version to another. The fact that the 11.1 serial code has the same output (to the 13th decimal place) as the threaded 10.1 code coupled with the fact that extensive test cases with analytical results have been performed with Intel 10.1 leads me to believe that the physically accurate numbers are the 10.1 numbers. It could be that these ~5%-10% differences don't make much of a physical difference overall but that would require a 3 month test run to confirm, and I'd rather avoid that.

scamicha · ‎06-21-2010

I tried compiling with -openmp-link static but recieve relocation errors. I see that the default is dynamic in 11.1 but can't find the option or default in 10.1 does anyone happen to know what it is?

TimP · ‎06-21-2010

10.1 included the choice of dynamic or static OpenMP library under -static-intel.