Can there be a difference in parallel computation results depending on the Intel compiler version?

Jin-yongKim · ‎03-30-2021

Hello,

I am conducting a global circulation model(GCM) experiment related to atmospheric science.
The model in use is AM2.0/LM2.0 developed by GFDL.

The setup was completed using Intel compiler versions 11 and 15 on both servers, respectively.
However, the results of the two model experiments were found to be different.
It doesn't make a very big difference. However, in some cases it can be significantly different.

All conditions are the same and only the compiler versions are different.
The compiler versions set up on each server are as follows.

Intel compiler v11
openmpi 1.2.3
netcdf 3.6.3
hdf5 1.8.4

Intel compiler v15
openmpi-1.8.7
netcdf 3.6.3
hdf5 1.8.13

Could the result value be different depending on the compiler version?

mecej4 · ‎03-30-2021

Sure, changing the compiler, changing the OS, changing the compiler options without changing the version, or even running the same EXE at different times -- all these can cause the output results to differ. Imprecise floating point computations can affect the results. You can find several instances where such effects were observed; examples are available in previous forum posts.

When such multiple causes exist, it is incorrect to focus on just one cause, such as the compiler version. In some instances, considerable investigative work with good tools may be necessary to uncover the causes.

It can be helpful to have available several test cases with verified outputs saved for comparison during later runs.

Jin-yongKim · ‎03-31-2021

Hello mecej4,

Thank you very much for your answer.
I understood the overall contents of the answer you mentioned. The questions and answers in the previous post you linked were also very helpful.

First of all, I'll have to check again around floating point calculation.Thank you.

jimdempseyatthecove · ‎03-30-2021

The default optimization of floating point methods may have varied since version 11 and 15. For instance, the trigonometric functions (sin, cos, etc...) and/or square root. This combined with as to if you compile using FPU or SSE/AVX/AVX512 instruction set. I am somewhat confident that given the same choices using both compilers that the results will be the same (for single threaded use).

Please note, that using SSE on version 11 verses AVX/AVX2/AVX512 using the newer compilers on newer architecture can reorder reduction sequences. IOW

sum = array(1:N)
vsSum1 = array(1:N:4)
vsSum2 = array(2:N:4)
vsSum3 = array(3:N:4
vsSum4 = array(4:N:4)
vsSum = vsSum1 + vsSum2 + vsSum3 + vsSum4

Depending on if, where, when round off errors occur, sum and vsSum may not be the same.

While you most likely won't code that way, a vector summation effectively performs in that manner (reductions occur across the width of the vector using a stride of the vector width).

Also, a similar effect occurs with OpenMP. Should you change the number of threads, then number of, and sizes of the partial sums vary, and thus should round off errors exist, they may occur at different places.

try adding option

/fp:precise or /fp:strict

Other options:

/Qprec improve floating-point precision (speed impact less than /Op)

/Qprec-sqrt[-]
determine if certain square root optimizations are enabled

/[no]fltconsistency
specify that improved floating-point consistency should be used

/Qprec-div[-]
improve precision of FP divides (some speed impact)

/Qfast-transcendentals[-]
generate a faster version of the transcendental functions

One of the first test (experiments) to do, is

Use /Qx<code> on the newer compiler wher <code> is instruction set (SSE2, SSE3, ..., AVX, ...) that is (was) supported by the older compiler.

Compile in debug build, with OpenMP stubs (or disabled, or set number of threads to 1).

Build using both compilers such that you can reasonably assure same instruction sets are used, (and same floating point methods). Then compare results.

Note, very old "gold" test results data files may have been generated using the FPU as opposed to the SIMD instruction sets. The internal precision of the FPU instruction set is higher than that of the SIMD instruction set. 80-bit vs 32-bit or 64-bit as the case may be.

Should you obtain the same or acceptable results, then gradually experiment with increasing the capabilities of each compiler version. Noting where changes occur.

Note, you can set different optimizations on different source files. Often the case can be that only one or two source files are found to be problematic.

Jim Dempsey

gib · ‎03-30-2021

Wasn't chaos theory kicked off by Edward Lorenz, when he found his climate model results were exquisitely sensitive to initial conditions? https://en.wikipedia.org/wiki/Chaos_theory

It doesn't come as a surprise that your model could be very sensitive to small variations in the results of some computations.

jimdempseyatthecove · ‎03-31-2021

>>Wasn't chaos theory kicked off by Edward Lorenz, when he found his climate model results were exquisitely sensitive to initial conditions?

aka The Butterfly Effect

A second cause for this behavior is caused by poorly written convergence routines where any small difference in round off error causes a significant number of different iterations of your convergence routine. In some cases this is noticed by failure to converge.

Also, be mindful that should the (large?) discrepancy be due to round off errors (as opposed to precision differences correctable using options), then this should bring in to doubt of the accuracy of your assumed correct results files.

Jim Dempsey

Jin-yongKim · ‎03-31-2021

Hello, Gib.

Thank you for your kind reply.

As you said, if the initial conditions change, the results of the experiment could change. I'll check the floating point calculation first.

But thank you again for reminding me of the good contents.