Solved: Intel 15 v 16 and imf: Consistency between processors

Matt_Thompson · ‎11-05-2015

O Fortran Gurus,

A model I work on (GEOS-5) has the standing policy that we should get zero-diff between processors (say, Haswell and Sandybridge). From about Intel 11 to Intel 15.0.2, this was accomplished with (roughly):

-O3 -qopt-report0 -ftz -align all -fno-alias -traceback -fpe0 -fp-model source

Obviously not all of those are important for this but I just copy-and-pasted. Probably -fp-model source is the magic sauce.

But, with Intel 16.0.0, some regression tests I run SLURM put a job on our SandyBridge instead of a Haswell and, hello, non-zero-diff! So I started adding/changing flags like -fp-source strict, -assume protect_parens, setenv MKL_CBWR (a hope since no BLAS or LAPACK is in it, but...), etc. Finally, I found that:

-fimf-arch-consistency=true

added to our FOPT got it working and zero-diff.

My question now is for my own benefit: does anyone know where a good release notes or something would be to figure out why the IMF behavior changed between Intel 15 and 16? I tried looking at the Release Notes for Intel 16 but I don't see anything about "math" or "imf" in it. (A simple nm on the libimf.a definitely shows a difference!)

Again, now that I have a solution, the practical part of me is done with this; it's the inquisitive part that is triggered now.

Steven_L_Intel1 · ‎11-05-2015

Please read the attached presentation. As you discovered, changing the processor type will cause the math library to take different code paths. Also, version-to-version improvements in accuracy or performance can cause small differences. 443546

View solution in original post

Steven_L_Intel1 · ‎11-05-2015

Please read the attached presentation. As you discovered, changing the processor type will cause the math library to take different code paths. Also, version-to-version improvements in accuracy or performance can cause small differences. 443546

Matt_Thompson · ‎11-05-2015

Steve Lionel (Intel) wrote:

Please read the attached presentation. As you discovered, changing the processor type will cause the math library to take different code paths. Also, version-to-version improvements in accuracy or performance can cause small differences. Download Improving Numerical Reproducibility in C++ and Fortran.pdf

Steve,

Indeed, that is where I found out about the flag! I was just wondering if Intel documents what changes between versions of IMF. Because, as I said, Intel 15 is zero-diff between Haswell and Sandybridge. But, I'm guessing for Intel 16 that someone went in and added some CORE-AVX2 optimizations to...exp or something?

Steven_L_Intel1 · ‎11-05-2015

Possibly. I don't have visibility into the changes made in libimf, but do know that they are continually looking at optimizations for newer processors, new algorithms and improved accuracy. You should expect changes like that with any release or even update. Changes in optimization across updates and versions can also cause small differences.

Matt_Thompson · ‎04-07-2016

I hate to resolve a zombie thread, but after upgrading to Intel Fortran 16.0.2, I am now seeing differences between Sandy Bridge and Haswell/Broadwell again even with

-fimf-arch-consistency=true

Does anyone know what might have changed? I'm going to try various flags again to see if I can pick it up.

TimP · ‎04-07-2016

Just in case, I'll point out that
-fimf-arch-consistency=true

works by compiling each source code to use a special set of internal function calls to math libraries. You should be able to see this by running nm on the compiled code. Any compilation unit which didn't use that option will call the math functions which switch at run time according to the detected hardware platform. I don't know if all objects may need to be rebuilt when changing compiler version.

Steven_L_Intel1 · ‎04-07-2016

Go back and reread the presentation I posted in reply #2. It's pointless to ask "what changed".

Matt_Thompson · ‎04-07-2016

Steve Lionel (Intel) wrote:

Go back and reread the presentation I posted in reply #2. It's pointless to ask "what changed".

Steve,

I guess I asked that because if I compile with Intel Fortran 16.0.0.109 and run on Haswell processors, and I then change *only* the compiler to Intel 16.0.2.181, I get *different* results. On a Sandy Bridge they are the exact same between compilers, but on Haswell (and Broadwell, I suspect), Intel Fortran 16.0.2.181 gives me different results.

I tried using options:

-fimf-arch-consistency=true -no-fma -fp-model precise -fp-model source

but I still see a difference. That is why I was wondering if Intel Fortran 16.0.2.181 handles any of the options for consistency differently than Intel Fortran 16.0.0.109.

My apologies for asking about this. I'm sure it is my fault and I will go through your presentation and the "Consistency of Floating-Point Results Using the Intel® Compiler" document as well.

Again, apologies for taking your time,

Matt

Steven_L_Intel1 · ‎04-07-2016

No apologies necessary. Note that -fimf-arch-consistency relates to math library functions only. What other options are you using?

How "different" are the results?

Matt_Thompson · ‎04-07-2016

Steve,

In my latest test I used:

-g -O0 -ftz -align all -fno-alias -traceback -debug 
-nolib-inline -fno-inline-functions -assume protect_parens,minus0 
-prec-div -prec-sqrt -check bounds -check uninit -fp-stack-check 
-ftrapuv -warn unused -traceback     -convert big_endian  -fPIC 
-fpe0 -fp-model strict -fp-model source -no-fma -fimf-arch-consistency=true
 -fimf-precision=high -heap-arrays 32  -align dcommons

As you can see, this is sort of a "melange" of all the debugging flags we run, and every "be strict" flag I could think of setting. I was hoping to stop the compiler from doing anything if possible.

The change seems to exhibit itself in our REAL64 dynamics code. For example, on a Sandy Bridge, I see an area diagnostic print:

Global Area= 510064471909928.

while on Haswell/Broadwell:

Global Area= 510064471909927.

The "ends-in-8" value of Sandy Bridge is what Haswell sees using 16.0.0.109. Obviously, it isn't a big difference that is happening, but after one time step, we essentially see a slightly different atmosphere and, boom, climate is different. This is then reflected in the checkpoint file for dynamics:

462578c462578
< 20813.7862929133, 20816.5744918611, 20817.6823877241, 20819.9651360954,
---
> 20813.7862929133, 20816.5744918611, 20817.6823877242, 20819.9651360954,

Again, the last place of a value (in this case pressure at a level edge) and that's enough to trigger a difference (this is one of ~700000 differences after one step).

I've been able to show this is happening on Intel MPI and on SGI MPT, so it is not MPI-related at all (i.e., Haswell + MPT == Haswell + IMPI).

Can you think of other flags to try?

Steven_L_Intel1 · ‎04-07-2016

Ah, the 1E-14 butterfly effect. Given all the switches you're using, I can't think of what additional might help. What I usually do in such situations is dump out intermediate calculations, identify WHEN things differed, and narrow the search until I find the computation that gave a different answer. Then I have to single-step through the instructions and see what was different.

I assume it's the same executable on both systems? Despite all the options you've thrown, there might be an uninitialized memory location being referenced that changes things. It's also possible that the two processors are giving different answers for a particular instruction, though we do VERY rigorous testing to make sure that doesn't happen. If it did happen we would want to know about it (with a test case.)

But you really don't know which of those two answers is correct, do you? If you have an unstable algorithm that gives very different results based on a 1LSB difference in a single computation, how can you trust the answers you're getting?

Matt_Thompson · ‎04-08-2016

Steve Lionel (Intel) wrote:

Ah, the 1E-14 butterfly effect.

But you really don't know which of those two answers is correct, do you? If you have an unstable algorithm that gives very different results based on a 1LSB difference in a single computation, how can you trust the answers you're getting?

Butterfly effect, indeed. The code is a climate model. Thus, you can change an A+B to B+A in one minor subprogram and instantly get different results. Are they vastly different? No. But thanks to chaotic behavior, they are not zero-diff. In truth, both results are probably both "correct" in that both compilers are just providing us with how they do the math. (In the above example, the Global Area is the surface area of the earth. The compilers are 1 m² different.)

We've strived to keep zero-diff results between chipsets; we have no expectation that different compilers or even MPI stacks will be identical to each other, but we've usually been able to get zero-diff results between different processor types using the same compiler/MPI stack combination. Perhaps we just can't assume that anymore.

Still, I'm trying to work and see if can isolate just the bits necessary to run the code to the point where that Global Area is calculated. If so, that could hopefully be a tester.

I'm also going to try and do an 16.0.0, 16.0.1, and 16.0.2 set of tests to see exactly where the Intel 16 behavior started.

jimdempseyatthecove · ‎04-08-2016

In a private email with Clay Breshears I'd suggested that it might be a good diagnostic tool that they (Intel) could produce a new tool by modifying MPI a small amount such that the programmer could launch two processes running the same program compiled either both ways, or specifically using one or the other instruction sets and/or optimizations. Then while running the two processes on either same or different CPUs/systems, the code could run up to synchronization points (either explicit or automatic) whereby differences can be made.

Essentially this would "automate" the lengthy process of a programmer running side-by-side debugging sessions.

As an added feature, two (or more) instances of the same build could be launched on each CPU/system, such that one of the same instances could run one gross synchronization point behind in time. Thus permitting the programmer to unwind the application one synchronization point, and permit single stepping to occur. For situations like Matt's, the problem point could be located in only the time it takes for the program to run up to the difference position. While run time will be longer due to consistency checks, it would be automated, and Matt could be doing something else.

Many times I have had to resort to side-by-side debugging and would have loved this tool.

Jim Dempsey

Matt_Thompson · ‎04-11-2016

thematt wrote:

I'm also going to try and do an 16.0.0, 16.0.1, and 16.0.2 set of tests to see exactly where the Intel 16 behavior started.

It does seem to be a change between 16.0.0 and 16.0.1. I can run with all the same MPI stacks, etc. and the only difference is the Fortran compiler module. I'm working now on a possible reproducer. It should exercise code that shows the difference in our full model, but without the requirement of needing gigabytes of boundary conditions. :)