Run-to-Run Reproducibility in OpenMP

JO__Masatoshi · ‎10-31-2018

Hi,

I want to know how to attain run-to-run reproducibility in my console OpenMP code. This post is a continuation from the thread "debugger can't see the threadprivate variables".

https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/783162 ;

My program MyApp works in a parallel region as follows.

...

!$OMP PARALLEL SHARED(many variables)

!$OMP DO schedule(dynamic)

do index =1, NumJobs

call Prg_core(index, & many arguments)

end do

!$OMP END DO NOWAIT
!$OMP END PARALLEL

...

In this example, the number of cores, the number of threads and NumJobs are 16, 13, and 16, respectively.

Every Prg_core reads the same data set and index, and outputs a result depending on the particular index. Prg_core works in its own thread exclusively and there is no data sharing between the threads, except the read-only shared variables. All threadprivate variables are initialized every time at the beginning of Prg_core. Under these circumstances, I think "schedule(dynamic)" would not matter.

When I repeat the total program several times on the same initial data set, some results corresponding to the same index remain the same but others do not, although the indexes that give the same results are almost the same, e.g., Job #1, Job#3,.. Since Prg_core is a non-linear optimization program needed to follow a long chain of search-and-update, the thread-specific difference of round-off errors may have been accumulated to give such considerably different final output.

This tendency is not remedied by setting the environment variable KMP_DETERMINISTIC_REDUCTION, as described in the manual, before launching the app in the command prompt, as follows.

*********************************************************************

set KMP_DETERMINISTIC_REDUCTION=TRUE

MyApp

set KMP_DETERMINISTIC_REDUCTION=

exit

********************************************************************

Here is a compiler's command line list.

/nologo /MP /O3 /QaxAVX /QxAVX /Qparallel /arch:AVX /Qopenmp /fpscomp:ioformat /fpscomp:general /warn:all /Qopt-report:5 /Qopt-report-phase:vec /Qinit:zero /Qinit:arrays /fpe:1 /fp:strict /module:"x64\Advisor_Rel_181026/" /object:"x64\Advisor_Rel_181026/" /Qvec /Qsimd //Fd"x64\Advisor_Rel_181026\vc120.pdb" /traceback /check:all /libs:static /threads /c

Any suggestion is welcome.

thanks in advance,

Masatoshi

TimP · ‎10-31-2018

Besides requiring /fp:strict, you could expect reproducibility only with a uniform number of threads. When we've wanted to do this, we've used conditional compilation to bypass all omp parallel reduction (rather than relying on something like KMP_DETERMINISTIC), so it would not work with /Qparallel.

Although it should't make a difference to reproducibility, we always set affinity, e.g. OMP_PLACES=cores.

Likewise, it might not affect reproducibility, but it's confusing when you have so many options over-riding each other in selection of instruction set. It seems simpler to set minimal options rather than checking to see whether /fp:strict is effective (including no use of simd math functions), when combined with those others. /fp:source ought to be as effective as /fp:strict as far as reproducibility is concerned, in case you have a performance question.

jimdempseyatthecove · ‎10-31-2018

!$OMP PARALLEL SHARED(many variables)
!$OMP DO schedule(dynamic)
do index =1, NumJobs
  call Prg_core(index, & many arguments)
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL
...

>>...Every Prg_core reads the same data set and index

Don't you mean for each index?

>>and outputs a result depending on the particular index.

Be aware that if the output is written at the end of Prg_core that there are two potential issues to consider: a) the sequence will not necessarily be in index increasing order, and, b) if the output is not protected by an !$OMP CRITICAL region, then the output by multiple threads could be blended.

>>Prg_core works in its own thread exclusively

Do you mean to say, Prg_core (and anything it calls) does not contain parallel regions?

If so, then /Qparallel might violate that assumption.

Jim Dempsey

gib · ‎10-31-2018

Often when I'm debugging my OpenMP programs I have to set number-of-cores = 1.

JO__Masatoshi · ‎11-01-2018

Tim,

Thank you for your comments. I have some questions.

Can you teach me more about your way how to bypass omp parallel reduction, not using KMP_ DETERMINISTIC?

What do you mean by "a uniform number of threads"?

To speak the truth, I'm not fully understood the difference of /fp:strict and /fp:source, because the descriptions look almost similar to me. Can you recommend some appropriate articles other than the manual?

JO__Masatoshi · ‎11-01-2018

Jim,

Sorry for confusing notation.

Let me show the program's structure again:
**************************************
program main

! declarations
! arguments input for Prg_core

!$OMP PARALLEL SHARED(many variables)
!$OMP DO schedule(dynamic) ! only and outer-most parallel loop,
do JobID =1, NumJobs
call Prg_core(JobID, some constants, number of completed jobs, file names)
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL

deallocate arrays
stop
end
*****************************************************

This is the one converted from a stand-alone program, which sequentially repeats [input-compute-output] cycles within the loop, according to the given job schedule. In the present version, the Job-wise cycle portion is encapsulated into Prg_core. My purpose is to make it by [the number of cores] times faster by calling independent and equivalent Prg_core's simultaneously. In the present version, each Prg_core loads the JobID-th part of the initial data from a common file like "data.in", and writes out the corresponding result in its own file like "JobID.tmp", in a parallel manner. The next step of my analysis is to examine all tmp files to determine which one is nearest to the "solution", i.e., which JobID was the best condition. Obviously, in such a comparison, numerical fluctuation induced by parallelism must NOT appear, in order to keep the procedure's consistency and meaningfulness. Inside Prg_core and deeper contexts, no other OMP parallel statements are used. The shared variables are read-only and are copied to their threadprivate counterparts before use. There are no codes to exchange variables between different threads. The I/O of variables is done in Prg_core with !$OMP CRITICAL enclosing.

Your last comment:
"Do you mean to say, Prg_core (and anything it calls) does not contain parallel regions? If so, then /Qparallel might violate that assumption."
is not clear to me. As shown above, explicit !$OMP PARALLEL/END PARALLEL is placed only at the outer-most loop. Other !$OMP statements are in Prg_core [critical] and a module [threadprivate] declaring threadprivate variables. This module is used through "use" statement in many routines that use threadprivate variables. Do you mean that I must use /Qparallel only when the source to compile actually includes !$OMP statements? Source-wise different option necessary?

JO__Masatoshi · ‎11-01-2018

gib,

Thank you for your comment.
In my opinion, it is not the phase to debug the logic in the source. I'm afraid that identifying how and when the numerical difference began to grow in the execution is an exhausting task, because of the involved data size and the computation time. Moreover, as I mentioned in my related thread ( see the link in the first post), my system cannot see the threadprivate variables within VS debugger, unless embedding special pointer codes.

jimdempseyatthecove · ‎11-02-2018

The /Qparallel option (which you have shown) informs the compiler to enable auto-parallelism for loops that appear to be effective when parallelized. Ergo, there is, unbeknownst to you, a potential for sections of your code to be parallelized. IOW remove /Qparallel, as you are explicitly performing your own parallelization using the OpenMP directives.

Depending on the version of the Fortran standards used while compiling, local arrays and possibly local user defined types may default to SAVE, and in which case, they become shared variables. To correct for this, consider adding /Qauto (I think, but am not sure, that later Fortran versions default to /Qauto when /Qopenmp is used).

Jim Dempsey

TimP · ‎11-02-2018

As Jim hinted, /Qparallel and /Qopenmp always implied /Qauto. If a procedure is compiled without /Qauto and called within a parallel region, local arrays are default SAVEd and will produce a race condition. That is likely to cause incorrect results and poor performance, likely neither of them repeatable. The solution which Steve Lionel has recommended is to designate all procedures RECURSIVE (with no SAVE or compile-time initialization by DATA et al in possible parallel regions), which has the same effect as /Qauto, but will avoid differences between compile options.

TimP · ‎11-02-2018

/Qparallel is meant to work along with /Qopenmp. I believe the compiler should not activate auto-parallel inside visible OpenMP parallel regions. With separate compilation, the compiler will apply auto-parallel to a procedure with no OpenMP. If that procedure is called in a parallel region, that is nested parallelism, which will be active at run time only with OMP_NESTED set (by environment variable or omp_nested call). Even without OMP_NESTED, it could create a situation where calling a procedure in a parallel region produces slightly different results than when called outside a parallel region and generating its own multiple threads.

The solution with conditional compilation goes something like this:

ifort /DMP /Qopenmp /fpp *.f90

....

subroutine doparallel

...

#if MP

!$omp parallel

..

#if _OPENMP

any omp function call

#endif

!$omp for

DO i=1,n

...

END DO

.

!dir$ omp end parallel

In such a simple case, you can turn off OpenMP t without the conditional compilation directives, simply by switching /Qopenmp to /Qauto (in case you didn't declare all procedures RECURSIVE), at the expense of getting compile warnings which may require care to distinguish from other diagnostics.

TimP · ‎11-02-2018

If you don't use a method such as KMP_DETERMINISTIC, results of omp for reduction will not be repeatable when the number of threads changes. For that reason, applications which I have worked on use the conditional compilation to avoid parallel reduction when reproducibility is required. They even go so far as to have separate sections of code with a run time option to select whether reproducibility is wanted at the expense of performance. Then the conditional compilation avoids more local changes in source code between the reproducibility code and the full parallel one.

ifort doesn't distinguish /fp:source from /fp:precise as ICL does. Either of these options will turn off simd reduction operations, calls to the svml library, and, presumably, setting /Qprec-div /Qprec-sqrt. These changes general have opposite effects on accuracy. simd reduction usually improves accuracy by an unpredictable amount, varying according to which instruction set you choose (/arch:AVX etc,) and with the data set. The svml (vectorized) math libraries produce less accurate results, varying with CPU architecture, for some math functions, particularly exp() and ** with large magnitude arguments. You have separate options to specify svml or not without allowing /fp: to make the choice.

You also have /Qftz- to select IEEE standard gradual underflow regardless of /fp: setting. This is probably advisable for any AVX capable CPU, as Intel made a strong effort to eliminate the associated performance effect for addition/subtraction (as that is more important than the performance of multiplication with underflow). This option actually takes effect only for compilation of the main program, as it is done in the CPU initialization. Presumably you could, as an alternative, always compile the main program with /fp:strict, as you don't do much in the main which affects performance.

The differences between /fp:precise and /fp:strict shouldn't affect reproducibility, if you don't change between these options. You would use /fp:strict if you want reproducible results between runs with and without IEEE exceptions.

These differences among /fp: options don't affect run to run reproducibility with a given data set, even if number of cores changes. The /Qx options do make results unreproducible among different CPU architectures (possibly even without /fp:precise and /Qfma-) . This is one of the reasons for limiting the number of /arch: options to those actually needed to realize the performance potential of the range of CPUs you support and avoiding /Qx (and possibly setting /Qimf-arch-consistency /Qprec-div /Qprec-sqrt).

Of the recent Intel architectures, only the MIC suffers much in performance from /Qprec-div /Qprec-sqrt, and that doesn't concern you now that the idea of supporting MIC is vanishing. So those options are normally used when reproducibility or accuracy is a consideration.

gib · ‎11-02-2018

JO,

Tracking down OpenMP issues can be a nightmare, especially if (like me) your understanding is limited. If you are using parallelisation in multiple places you might consider turning it off selectively to try to located the code section where differences develop. I do understand the difficulty of debugging when problems emerge only after long execution time - been there, done that.

Best of luck

Gib

TimP · ‎11-02-2018

You may have seen this by now:

Beginning with ifort 17, the option /fp:consistent is available. This is the same as the combination /fp:precise /Qfma- /Qimf-arch-consistency:true (I have trouble myself remembering to append the true, without which that option is ignored.

For example, these options should make AVX2 results identical numerically to AVX.

Martyn_C_Intel · ‎11-06-2018

Hi JO,

If your app contains any OpenMP (i.e., threaded) reductions, it will not be reproducible with dynamic scheduling, even if the number of threads does not change and you specify KMP_DETERMINISTIC_REDUCTION=true. Your description implies that this is not the case. Nevertheless, it would be worth a test replacing dynamic by static.

Uninitialized variables can sometimes cause variations in results. I can’t remember whether /Qinit works for variables that appear in a PRIVATE clause, but I would not count on it.

Race conditions can cause variations in results. As others have already pointed out, Prg_core and any functions that it calls need to be thread safe, so compiled with /Qopenmp or /Qauto or /recursive. They should have no static (SAVEd) variables that are written to. Initializing variables in a DATA statement makes them static.

I would also omit the NOWAIT clause while you are debugging. I have seen examples where that introduced unexpected race conditions.

Your command line shows some misunderstandings and contains several options that contradict each other, again as others have pointed out.

I would remove /QaxAVX and /arch:AVX (since you are using /QxAVX) and /Qparallel (since you are using /Qopenmp). Also omit /Qsimd and /Qvec, since vectorization is enabled by default at /O2 and above.

Except when debugging, I would also remove /check:all. I believe the latter enforces /Od, so overrides all other optimization switches except /Qopenmp.

/fp:strict is OK but a somewhat heavy hammer, /fp:precise is usually sufficient unless you are modifying the default floating-point environment. If your goal is only to get reproducible results from one run to the next for the same executable and data, /Qopt-dynamic-align- might be sufficient.

If there are no coding errors and no threaded reductions, these options should be sufficient to obtain reproducible results. Since you were effectively compiling at /Od, I think the most likely source of the variations you see is the threading, either a race condition or a threaded reduction (which is a special sort of race condition).

You can read more about floating-point reproducibility in my article attached at

https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

There’s also a shorter, older article that specifically addresses run-to-run reproducibility at

https://software.intel.com/en-us/articles/run-to-run-reproducibility-of-floating-point-calculations-for-applications-on-intel-xeon

There’s an article about threading Fortran applications using the Intel compiler at

https://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/

Martyn_C_Intel · ‎11-06-2018

I should add that, if static scheduling does not help, you might consider using Intel Inspector to look for possible thread safety issues (race conditions). Inspector could probably also help spot uninitialized variables.

jimdempseyatthecove · ‎11-07-2018

Try

!$omp do ordered schedule(dynamic, ChunkSizeYouDetermineWhatIsBest)
do index=1,...
call Prg_core(...
end do

subroutine Prg_core(...
...
!$omp ordered
! your reduction here
!$omp end ordered
end subroutine Prg_core

The chunk size should be a size that does not vary with varying numbers of threads. The ordered clause assures chunks are started in order, the ordered region assueres the ordered section is processed in chunk order.

Jim Dempsey

jimdempseyatthecove · ‎11-07-2018

Note-1, the chunk subsection , for the same input data, should be reproducible as long as the chunk size does not change.

Note 2, as long as the chunk subsection contribution (Note-1) to the reduced value are accumulated in the same order, then reproducibility will be maintained.

Note 3, varying the chunk size can vary the end result of the reduction. Varying the thread count (using same chunksize) will not.

Jim Dempsey