MKL numerical stability and threading

millidred · ‎11-24-2009

Hello

I am wondering how/whether the number of threads affect the numerical results computed by MKL.

Section 8.1 of the MKL (version 10.1) user's guide states that

"With a given Intel MKL version, the outputs will be bit-for-bit identical provided all the following conditions are met:
the outputs are obtained on the same platform;
the inputs are bit-for-bit identical;
the input arrays are aligned identically at 16-byte boundaries."

Does this really mean, that I can link the threaded MKL libraries and vary the value of OMP_NUM_THREADS and still expect bit-for-bit identical results given the above conditions are met?

Thanks for you comments!

Dmitry_B_Intel · ‎11-24-2009

Hello,

Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".

Thanks
Dima

dbacchus · ‎11-24-2009

Quoting - Dmitry Baksheev (Intel)

Hello,

Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".

Thanks
Dima

Dima, could you please provide a link or reference to the abovementioned OpenMP specification? It makes a lot of sense, of course:e.g.,if onecalculates a product of many variables (in a parallel loop), the result will depend on the order of multiplication due to the truncation errors.

TimP · ‎11-24-2009

Quoting - dbacchus

Dima, could you please provide a link or reference to the abovementioned OpenMP specification? It makes a lot of sense, of course:e.g.,if onecalculates a product of many variables (in a parallel loop), the result will depend on the order of multiplication due to the truncation errors.

OpenMP is expected to produce numerical variations when reduction operators are in use.
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98

dbacchus · ‎11-24-2009

Thanks, tim18!

TimP · ‎11-24-2009

Quoting - tim18

OpenMP is expected to produce numerical variations when reduction operators are in use.
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98

By the way, and off the current topic, much more severe variations are observed in certain implementations of MPI reduction operators. The implementors of Intel MPI (and apparently openmpi) have achieved satisfactory results, so it seems to be treated as a Quality of Implementation issue rather than a standards question.
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.

yuriisig · ‎11-24-2009

Quoting - tim18

By the way, and off the current topic, much more severe variations are observed in certain implementations of MPI reduction operators. The implementors of Intel MPI (and apparently openmpi) have achieved satisfactory results, so it seems to be treated as a Quality of Implementation issue rather than a standards question.
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.

It is possible to result a concrete example for OpenMP?

TimP · ‎11-24-2009

Quoting - yuriisig

It is possible to result a concrete example for OpenMP?

Do you mean an example of a commercial application which offers the user a choice of alternate code paths with or without OpenMP reduction? LS-DYNA/SMP and LS-DYNA/hybrid offer such options.

yuriisig · ‎11-24-2009

Quoting - tim18

LS-DYNA/SMP and LS-DYNA/hybrid offer such options.

Easier LS-DYNA something exists?

TimP · ‎11-24-2009

Do you mean how does the source code give a choice of OpenMP reduction or no reduction?
// read user option, set/reset omp_reduction_ok
#pragma omp for reduction(+:sumall) if(omp_reduction_ok)
...

Compiler options which prevent vectorized sum reduction might also be set, if there is no control over data alignment, e.g. /fp:source for Intel Windows compilers, omit -ffast-math for gnu compilers. There is no need for alignment dependent code on recent CPUs like Barcelona, Core i7, .... but compilers tend to do it so as to optimize for earlier CPUs.

millidred · ‎11-25-2009

Quoting - Dmitry Baksheev (Intel)

Hello,

Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".

Thanks
Dima

Hi Dima

It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.

Regards,
Roman

TimP · ‎11-25-2009

Quoting - millidred

It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.

dgemm may not require reduction operators, when implemented efficiently. The reduction-like operation in dnrm2 may not involve roundoff variations with order of operations, even if it is OpenMP threaded. Anyway, it may be difficult to expose and test all opportunities for the variations which OpenMP standard warns about; this tells you only that there is no guarantee.

Dmitry_B_Intel · ‎11-25-2009

Quoting - millidred

Hi Dima

It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.

Regards,
Roman

Hi Roman,

You've been lucky to get bit-to-bit reproducible results in your dgemm tests.Function dnrm2 is not parallel in that version of MKL, so no surprise. I attach a dgemm tests that would fail on if run long enough.

Thanks
Dima

millidred · ‎11-26-2009

Quoting - Dmitry Baksheev (Intel)

Hi Roman,

You've been lucky to get bit-to-bit reproducible results in your dgemm tests.Function dnrm2 is not parallel in that version of MKL, so no surprise. I attach a dgemm tests that would fail on if run long enough.

Thanks
Dima

Hi Dima

Thanks for your test code. It clearly shows, that the parallel dgemm routine does not produce bit-to-bit identical results. I must have been lucky indeed in my tests.

Regards,
Roman