- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I am wondering how/whether the number of threads affect the numerical results computed by MKL.
Section 8.1 of the MKL (version 10.1) user's guide states that
"With a given Intel MKL version, the outputs will be bit-for-bit identical provided all the following conditions are met:
the outputs are obtained on the same platform;
the inputs are bit-for-bit identical;
the input arrays are aligned identically at 16-byte boundaries."
Does this really mean, that I can link the threaded MKL libraries and vary the value of OMP_NUM_THREADS and still expect bit-for-bit identical results given the above conditions are met?
Thanks for you comments!
I am wondering how/whether the number of threads affect the numerical results computed by MKL.
Section 8.1 of the MKL (version 10.1) user's guide states that
"With a given Intel MKL version, the outputs will be bit-for-bit identical provided all the following conditions are met:
the outputs are obtained on the same platform;
the inputs are bit-for-bit identical;
the input arrays are aligned identically at 16-byte boundaries."
Does this really mean, that I can link the threaded MKL libraries and vary the value of OMP_NUM_THREADS and still expect bit-for-bit identical results given the above conditions are met?
Thanks for you comments!
Link Copied
13 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".
Thanks
Dima
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitry Baksheev (Intel)
Hello,
Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".
Thanks
Dima
Dima, could you please provide a link or reference to the abovementioned OpenMP specification? It makes a lot of sense, of course:e.g.,if onecalculates a product of many variables (in a parallel loop), the result will depend on the order of multiplication due to the truncation errors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - dbacchus
Dima, could you please provide a link or reference to the abovementioned OpenMP specification? It makes a lot of sense, of course:e.g.,if onecalculates a product of many variables (in a parallel loop), the result will depend on the order of multiplication due to the truncation errors.
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, tim18!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
OpenMP is expected to produce numerical variations when reduction operators are in use.
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98
The quotation appears in OpenMP standard http://www.openmp.org/mp-documents/spec30.pdf pg 98
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
By the way, and off the current topic, much more severe variations are observed in certain implementations of MPI reduction operators. The implementors of Intel MPI (and apparently openmpi) have achieved satisfactory results, so it seems to be treated as a Quality of Implementation issue rather than a standards question.
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.
Certain hybrid OpenMP/MPI applications have options to bypass OpenMP reduction so as to permit changing the number of threads (but not number of MPI processes), without producing numerical variations. Typically, there is a significant performance penalty involved in avoiding OpenMP or MPI reductions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - yuriisig
It is possible to result a concrete example for OpenMP?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean how does the source code give a choice of OpenMP reduction or no reduction?
// read user option, set/reset omp_reduction_ok
#pragma omp for reduction(+:sumall) if(omp_reduction_ok)
...
Compiler options which prevent vectorized sum reduction might also be set, if there is no control over data alignment, e.g. /fp:source for Intel Windows compilers, omit -ffast-math for gnu compilers. There is no need for alignment dependent code on recent CPUs like Barcelona, Core i7, .... but compilers tend to do it so as to optimize for earlier CPUs.
// read user option, set/reset omp_reduction_ok
#pragma omp for reduction(+:sumall) if(omp_reduction_ok)
...
Compiler options which prevent vectorized sum reduction might also be set, if there is no control over data alignment, e.g. /fp:source for Intel Windows compilers, omit -ffast-math for gnu compilers. There is no need for alignment dependent code on recent CPUs like Barcelona, Core i7, .... but compilers tend to do it so as to optimize for earlier CPUs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitry Baksheev (Intel)
Hello,
Yet another condition should be met: MKL shall be run in sequential mode. Generally, precise result of threaded algorithmsmay depend not only on the number of threads but also onthe order in whichthe threads are executed. For example, specification of OpenMP states: "...comparing one parallel run to another (even if the number of threads used is the same), there is no guarantee that bit-identical results will be obtained...".
Thanks
Dima
Hi Dima
It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.
Regards,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - millidred
It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - millidred
Hi Dima
It's strange... Using MKL 10.1 on intel64 I seem to get bit-identical results for 1 to 4 threads. I have tested e.g. the dnrm2 and dgemm BLAS functions.
Regards,
Roman
Hi Roman,
You've been lucky to get bit-to-bit reproducible results in your dgemm tests.Function dnrm2 is not parallel in that version of MKL, so no surprise. I attach a dgemm tests that would fail on if run long enough.
Thanks
Dima
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitry Baksheev (Intel)
Hi Roman,
You've been lucky to get bit-to-bit reproducible results in your dgemm tests.Function dnrm2 is not parallel in that version of MKL, so no surprise. I attach a dgemm tests that would fail on if run long enough.
Thanks
Dima
Hi Dima
Thanks for your test code. It clearly shows, that the parallel dgemm routine does not produce bit-to-bit identical results. I must have been lucky indeed in my tests.
Regards,
Roman

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page