/O1 /O3 Optimisations and OpenMP

NSwindale · ‎08-11-2021

I have been using OpenMP to multi-thread an application, taking care to make routines threadsafe by making local variables automatic and allocating any larger arrays to be used by the subroutines. I have a flag which can turn off the multi-threaded routines for testing. The non-threaded option gives identical results each time it is run from the same initial conditions, as it should, but the multi-threaded routines give slightly different results each time they are run. Turning the compiler optimisations off (/Od) makes this problem go away (and slows down the code substantially) - the threaded version then gives identical results each time and the same as the non-threaded version.

Is there some incompatibility between /O1 and higher options and OpenMP or is there something else I should be doing to make my routines thread-safe?

I have been using Intel Fortran Parallel Studio XE 2019 and Visual Studio 2017

Thanks for any advice,

Nick S.

Steve_Lionel · ‎08-11-2021

Incompatibility? No. If your code computes things in parallel and then combines them (such as sum), the order in which the operations are performed can change results slightly (or majorly, if you get cancellation errors.) The behavior you note is fairly typical.

NSwindale · ‎08-11-2021

Thanks. It is good to know that the optimisations should not be incompatible with Open MP. I have no idea what they do, so can't judge myself. I had not thought about the point that the order of operation with floating point numbers can make a difference. However, I am fairly sure my threads are not combining to calculate floating point values. Also, turning off the optimisations does make the results repeatable so I am still puzzled.

I cannot rule out that I do not have some thread-unsafe code somewhere or that I am doing something that the optimisations are assuming will not happen.

jimdempseyatthecove · ‎08-12-2021

>> I am fairly sure my threads are not combining to calculate floating point values.

This is seldom the case. IOW the preponderance of code contain reductions (sums, averages, variances, etc...) and the order in which the operations occur can vary the results. Floating point has a finite number of bits for precision. When numbers of different magnitudes (different exponents) are added, one of the (binary) values is shifted to align the exponents prior to addition. This causes a (potential) loss in bits (precision). In cases of non-integral values and/or fractions that are inexact and/or shift positions shift out non-zero significant bits, you encounter a loss in precision in the result.

You must consider results as an approximation. And you can estimate the error if you wish to go to the trouble of doing so.

In addition to order of reduction changing the resultant approximation you also may have sections of your code (that you parallelized) that have order dependencies. Example:

A(I) = (A(I-1)+A(I)) / 2.0

When, say, two threads split the iteration space, the starting point of the second partition will likely use the pre-updated value of its first A(I-1).

I would suggest that you examine the loops you parallelized to see if at start or end of loop that they depend on the results of the preceding or following partition. If so, then you either have to forgo parallization or add cleanup code to propagate carries or borrows across partitions.

Jim Dempsey

NSwindale · ‎08-12-2021

Thanks - I am pretty sure that in this case the threads are working independently - if they were writing to common addresses I would have to know about it (presumably?) but I don't think that is happening. Also, at least in the past, the threaded and unthreaded versions have given identical results each time - based on tests that should be sensitive to small differences. I do take the point about order and floating point results. The fact that your calculations are reproducible each time does not mean that they are perfectly accurate. Computations no longer being determinate seems a bit of an issue but a separate topic.

My main goal is to rule out things, like the optimisation level, that might conceivably have an impact - this seems unlikely based on Steve Lionel's comment but maybe they could have an effect if something else was wrong. It could still be the case that something is making the threads not work independently due to careless programming on my part, and indeed the problem does seem localisable to one of the threaded routines (the tests suggested by Steve Lionel). If that routine is set to be unthreaded the problem goes away. I'm still trying to figure out what might be wrong with the routine in question. It is certainly helpful to get this feedback.

Steve_Lionel · ‎08-11-2021

Always possible. I don't know your code so all I can suggest is to dump intermediate results and see where they start to diverge. Unfortunately, adding such code is likely to change or hide the issue. You can also selectively turn optimization on and off for a specific source file (assuming multiple files) and see what changes.

jimdempseyatthecove · ‎08-12-2021

>>if they were writing to common addresses I would have to know about it (presumably?)

The example of of loop order variant code was presented above where each thread writes to different slices of the output array...

*** however *** the time state of an input variable is out of phase.

Assume thread one iterates 1:100 and thread two iterates 101:200

Thread two can read A(100), an output of thread 1 ... PRIOR to thread one updating A(100).

A similar situation could occur at any boundary of slices when the slice update is dependent on adjacent slice values in a different time state than that as would be expected in the serial implementation.

Unoptimized code is generally performed using scalar instruction, Whereas optimized code may be vectorized. As an example, a summation of an array in scalar, indexed 1:N would sum in index order of 1, 2, 3,...,N

Whereas on a system with 4-wide SIMD vectors it is performed in two steps.

First as vector wide summation:

lane 0 having sum of A(I) at indexes 1, 5, 9, ...

Lane 1 having sum of A(I) at indexes 2,6,10,...

Lane 2 having sum of A(I) at indexes 3,7,11...

Lane 3 having sum of A(I) at indexes 4,8,12,...

Then as step two the lanes will be summed together.

Should the summation of the array contain round off errors, then the order in which the summation is made can exhibit different approximations of the true summation.

>>My main goal is to rule out things, like the optimisation level, that might conceivably have an impact

If (when) the results are not within a margin of error, .AND. when the /O0 and/or serial code produces what you deem correct, then a good practice is to incrementally introduce optimizations (then parallelization) on a source file by source file basis. This will generally isolate it discrepancy to one or a few source files. You can then examine the code further as well as add instrumentation to determine if the error is:

a) benign, b) programming error, c) compiler code generation error

Jim Dempsey

jimdempseyatthecove · ‎08-12-2021

One more area where little differences can cause large differences is in poorly written convergence code where a small difference in value being converged to, results in a different (and some times much larger or smaller) number of iterations.

Jim Dempsey

NSwindale · ‎08-18-2021

Thanks. There are indeed converging loops in the section of code that seems to be causing problems, however the different threads are working independently, as elsewhere. They don't ever write to shared addresses, or are not meant to, or read from addresses that have been written to by other threads. Think of doing image analysis on several different images in parallel. I have just verified again that /O1 and /O3 a) give slightly different results, also b) that each option gives reproducible (consistent) results on its own if just one subroutine (out of several) is de-parallelized. If this subroutine is parallelised then /O3 gives different results each time and /O1 gives consistent results.

The fact that /O1 gives consistent results suggests I do not have the kinds of issues you were referring to - and I can't see that I do.

However perhaps there is still something I am doing wrong that only shows up if /O3 is used. I have been making sure that all local variables and arrays are declared as automatic and have been taking care to declare variables as private or shared within the parallelised loops as seems appropriate.

I appreciate there is only so much advice one can ask for.

Thanks again for the feedback,

Nick

NSwindale · ‎08-18-2021

And I have also checked that there are not shared variables within parallelized loops where the order of operations could affect the outcome.

Ron_Green · ‎08-18-2021

You can read the attached from slide 3. And refer to the articles at the end for reference.

Also, this deck is old. There is a env var that may help
export KMP_DETERMINISTIC_REDUCTION=TRUE

https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compilation/supported-environment-variables.html

used in conjunction with compiler option -fp-model precise.

But read through this and research.

jimdempseyatthecove · ‎08-19-2021

Nick, read the description for KMP_DETERMINISTIC_REDUCTION=TRUE. This will not (necessarily) make the reductions the same between serial and parallel, as well as parallel with different numbers of threads, as well as non-static scheduled parallel loops.

>>b) that each option gives reproducible (consistent) results on its own if just one subroutine (out of several) is de-parallelized.

By this do you mean one specific subroutine .OR. do you mean when each (one at a time) subroutine is de-parallelized?

Hopefully it is the specific subroutine as this isolates the problem (behavior).

First step is to make a run using /O3 but with OpenMP disabled.

a) If this produces consistent results with /O1 serial then the issue is with the OpenMP parallel region(s).

b) If this produces inconsistent results with /O1 serial .AND. these results are repeatable then the issue is likely due to SIMD reductions. Note, these results are (likely) as good as the /O1 results (differ only in round off accumulation differences).

c) OpenMP + /O3 + KMP_DETERMINISTIC_REDUCTION=TRUE if this produces repeatable results but differ from /O1 serial then this is likely due to either/both SIMD reductions or reductions of partial results from threads at end of parallel loop.

If the problem is with a single subroutine, can you post it here for us to review?

Jim Dempsey

jimdempseyatthecove · ‎08-19-2021

example of reduction difference depending on order of reduction (the following compiled /Od)

!  ReductionDifferences.f90 
program ReductionDifferences
    implicit none
    integer, parameter :: N=2000
    real(8) :: array(N)
    integer :: I
    real(8) :: sum
    do I=1,N
        array(I) =  1.1_8**I
    end do
    sum = 0.0_8
    do I=1,N
        sum = sum + array(I)
    end do
    print *, "Low papa high-um ", sum
    sum = 0.0_8
    do I=N,1,-1
        sum = sum + array(I)
    end do
    print *, "High papa low-hum", sum
    
end program ReductionDifferences

Results:

 Low papa high-um   6.710625481394731E+083
 High papa low-hum  6.710625481394721E+083

While this difference is small, you can see where an ill written convergence routine could take a different number of iterations or fail to converge (or take an exit from the convergence loop).

Jim Dempsey