I am running a simulation using a Fortran codebase which is claimed to be entirely Fortran-standard compliant and fully deterministic (it does contain random numbers, but the seed is properly initialized to a deterministic value for reproducability). When the code is compiled in debug (no optimization) and release (with almost all of the optimization flags on) modes, there are occasionally, once every few hundred steps in the simulation, tiny differences in the simulation results. These differences are mostly at the last precision digits of the output real numbers (all real numbers in the simulation are declared as real64). Whether the last precision is the 4th or the 8th precision digit in the output seems to be irrelevant, so it seems like the issue is related to the rounding of real numbers at the time of IO.
My broader question: In general, is it reasonable to expect exactly same output results from a deterministic simulation written in standard-compliant Fortran and compiled with different optimization levels of ifort switched on or off? or does any difference between release and debug modes indicate some non-compliance with Fortran standard or some hidden bug in the code?
These are the release mode ifort flags used:
/fast /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive
and these are the debug mode ifort flags used:
/debug:full /Zi /CB /Od /Qinit:snan,arrays /warn:all /gen-interfaces /traceback /check:all /check:bounds /fpe-all:0 /Qdiag-error-limit:10 /Qtrapuv
It is not reasonable to expect exactly reproducible results.
The nature of floating point operations means that the order of execution matters, and the order of execution is something that may be changed by optimization.
From a standard perspective, within a statement a processor can replace an expression by something that is mathematically equivalent (subject to honoring parentheses) - but evaluating a mathematical equivalent expression with floating point arithmetic may not give the same result.
Specifically with your compile options, /fast enables floating point calculations that trade off precision for increase speed.
You could have a hidden bug in your code too!
In addition to IanH's comments, consider that in Debug build (/Od) the generated code will use scalar operations on sections of code that can be vectorized. Whereas in Release build (/O3) those sections of code will be vectorized. Consider the case of a CPU with AVX2 and real64 perfroming a sum of an array. In scalar mode the operation is:
sum = A(1) + A(2) + ... + A(N) ! left to right, one after the other
In vector mode:
sum_lane_0 = A(1) + A(5) + ... + A(N-3) ! the following 3 lines occur in same instruction(s)
sum_lane_1 = A(2) + A(6) + ... + A(N-2)
sum_lane_3 = A(3) + A(7) + ... + A(N-1)
sum_lane_4 = A(4) + A(8) + ... + A(N)
sum = (sum_lane_0 + sum_lane_1) + (sum_lane_2 + sum_lane_3) ! horizontal sum of lanes
Each of the lanes will (may) experience round off errors at different points of the array as are produced in the scalar summation of the same array.
IOW while the summations are reproducible with multiple runs within the same build, the results may not necessarily be the same between builds.
Additionally, the precision of intrinsic functions such as sqrt, sin, cos vary depending on optimization options selected
See Dr Fortran's paper Improving Numerical Reproducibility in C/C++/Fortran
Its message - "The Three Objectives • Accuracy • Reproducibility • Performance Pick two" - is generally useful to keep in mind.
Thank you all for your great responses. and for the slides by FortranFan, very helpful. For future reference, I also include a response by Tim, who, for some reason, could not post on the new forum and had to respond via direct message:
The forum is not permitting me to log in normally. Even if you aren't setting the goals of reproducibility,-fast is a dangerous option that I would never use. For reproducibility, you would turn off optimization which is known to cause variation in results by setting /fp: source . (precise is the same for ifort) The one part of /fast which you would want for performance is /Qxhost . That said, your results are so close that any optimization could make the difference.