Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

MKL numerical stability and threading

millidred
Beginner
983 Views
Hello

I am wondering how/whether the number of threads affect the numerical results computed by MKL.

Section 8.1 of the MKL (version 10.1) user's guide states that

"With a given Intel MKL version, the outputs will be bit-for-bit identical provided all the following conditions are met:
the outputs are obtained on the same platform;
the inputs are bit-for-bit identical;
the input arrays are aligned identically at 16-byte boundaries."

Does this really mean, that I can link the threaded MKL libraries and vary the value of OMP_NUM_THREADS and still expect bit-for-bit identical results given the above conditions are met?

Thanks for you comments!
0 Kudos
13 Replies
Dmitry_B_Intel
Employee
983 Views

Hello,

Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".

Thanks
Dima
0 Kudos
dbacchus
Beginner
983 Views

Hello,

Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".

Thanks
Dima

Dima, could you please provide a link or reference to the abovementioned OpenMP specification? It makes a lot of sense, of course:e.g.,if onecalculates a product of many variables (in a parallel loop), the result will depend on the order of multiplication due to the truncation errors.
0 Kudos
TimP
Honored Contributor III
983 Views
Quoting - dbacchus

Dima, could you please provide a link or reference to the abovementioned OpenMP specification? It makes a lot of sense, of course:e.g.,if onecalculates a product of many variables (in a parallel loop), the result will depend on the order of multiplication due to the truncation errors.
OpenMP is expected to produce numerical variations when reduction operators are in use.
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98
0 Kudos
dbacchus
Beginner
983 Views

Thanks, tim18!

0 Kudos
TimP
Honored Contributor III
983 Views
Quoting - tim18
OpenMP is expected to produce numerical variations when reduction operators are in use.
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98
By the way, and off the current topic, much more severe variations are observed in certain implementations of MPI reduction operators. The implementors of Intel MPI (and apparently openmpi) have achieved satisfactory results, so it seems to be treated as a Quality of Implementation issue rather than a standards question.
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.
0 Kudos
yuriisig
Beginner
983 Views
Quoting - tim18
By the way, and off the current topic, much more severe variations are observed in certain implementations of MPI reduction operators. The implementors of Intel MPI (and apparently openmpi) have achieved satisfactory results, so it seems to be treated as a Quality of Implementation issue rather than a standards question.
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.
It is possible to result a concrete example for OpenMP?
0 Kudos
TimP
Honored Contributor III
983 Views
Quoting - yuriisig
It is possible to result a concrete example for OpenMP?
Do you mean an example of a commercial application which offers the user a choice of alternate code paths with or without OpenMP reduction? LS-DYNA/SMP and LS-DYNA/hybrid offer such options.
0 Kudos
yuriisig
Beginner
983 Views
Quoting - tim18
LS-DYNA/SMP and LS-DYNA/hybrid offer such options.

Easier LS-DYNA something exists?
0 Kudos
TimP
Honored Contributor III
983 Views
Do you mean how does the source code give a choice of OpenMP reduction or no reduction?
// read user option, set/reset omp_reduction_ok
#pragma omp for reduction(+:sumall) if(omp_reduction_ok)
...

Compiler options which prevent vectorized sum reduction might also be set, if there is no control over data alignment, e.g. /fp:source for Intel Windows compilers, omit -ffast-math for gnu compilers. There is no need for alignment dependent code on recent CPUs like Barcelona, Core i7, .... but compilers tend to do it so as to optimize for earlier CPUs.
0 Kudos
millidred
Beginner
983 Views

Hello,

Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".

Thanks
Dima

Hi Dima

It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.

Regards,
Roman
0 Kudos
TimP
Honored Contributor III
983 Views
Quoting - millidred

It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.

dgemm may not require reduction operators, when implemented efficiently. The reduction-like operation in dnrm2 may not involve roundoff variations with order of operations, even if it is OpenMP threaded. Anyway, it may be difficult to expose and test all opportunities for the variations which OpenMP standard warns about; this tells you only that there is no guarantee.
0 Kudos
Dmitry_B_Intel
Employee
983 Views
Quoting - millidred

Hi Dima

It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.

Regards,
Roman

Hi Roman,

You've been lucky to get bit-to-bit reproducible results in your dgemm tests.Function dnrm2 is not parallel in that version of MKL, so no surprise. I attach a dgemm tests that would fail on if run long enough.

Thanks
Dima
0 Kudos
millidred
Beginner
983 Views

Hi Roman,

You've been lucky to get bit-to-bit reproducible results in your dgemm tests.Function dnrm2 is not parallel in that version of MKL, so no surprise. I attach a dgemm tests that would fail on if run long enough.

Thanks
Dima

Hi Dima

Thanks for your test code. It clearly shows, that the parallel dgemm routine does not produce bit-to-bit identical results. I must have been lucky indeed in my tests.

Regards,
Roman

0 Kudos
Reply