MKL library produces different results depending on OpenMP thread count

jethro_tull · ‎01-03-2024

On two different platforms, I have seen the MKL library produce different results depending on OpenMP thread count. The particular routine in question is the LAPACK CGETRS() routine. I am using Intel OneAPI rev. 2021.4.0 with the accompanying MKL revision. The two platforms where I saw this problem were:
Intel E5-2699v4 Broadwell, Suse Linux
Intel Xeon E5-4627, RHEL 7.9
Specifically, I input a matrix with dimensions of a few dozen rows and columns into CGETRS() with absolutely no explicit OpenMP in my source code. Then, when I set OMP_NUM_THREADS to 1, I get a particular result. But then I run again with OMP_NUM_THREADS set to 2 (or 3 or ...) and I get another result. The results differ by a small amount (1e-6 or less), but still, shouldn't they be bit-wise equivalent? I want to emphasize that there is no OpenMP in my actual source code.
Has anyone seen this before? Thanks.

George_Silva_Intel · ‎01-04-2024

Hello @jethro_tull ,

oneMKL uses OpenMP so the numbers of threads will impact the results. This is a well know challenge in scientific computation. oneMKL covers this topic in our reference guide and provides some ways to control (links below), note that performance might be impacted in order to ensure reproducibility of results:

Reproducibility Conditions (link)

To get reproducible results from run to run, ensure that the number of threads is fixed and constant. Specifically:

If you are running your program with OpenMP* parallelization on different processors, explicitly specify the number of threads.
To ensure that your application has deterministic behavior with OpenMP* parallelization and does not adjust the number of threads dynamically at run time, set MKL_DYNAMIC and OMP_DYNAMIC to FALSE. This is especially needed if you are running your program on different systems.
If you are running your program with the Intel® Threading Building Blocks parallelization, numerical reproducibility is not guaranteed.

More information on the oneMKL Developer Reference guide

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/current/conditional-numerical-reproducibility-control.html

jethro_tull · ‎01-05-2024

OK. Understood. Thanks. I'm a little surprised that cgetrs() would be a routine that would change dependent on thread count, but OK.

I have tried to puzzle through the pages on CNR mode. If I understand it right, in C I could invoke strict CNR with the call

mkl_cbwr_set(MKL_CBWR_STRICT);

Is that correct? If so, though, what I really need is something in Fortran. Is there such a call in Fortran? Thanks!

George_Silva_Intel · ‎01-05-2024

Thank you for reaching out!

Here is the Developer Reference Guide for Fortran on the CNR topic:

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2024-0/conditional-numerical-reproducibility-control.html

Best Regards,

__

George Silva

jethro_tull · ‎01-08-2024

George,

Thanks for your help. After looking through the site, I have tried adding the following to my code:

integer :: mkl_status, cbwr_branch
cbwr_branch = mkl_cbwr_get_auto_branch()
mkl_status = mkl_cbwr_ser(cbwr_branch)

This code seemed to run without issue but it did not result in any difference in output.

I tried adding a fourth line afterwards:

mkl_status = mkl_cbwr_set(MKL_CBWR_STRICT)

After this call, mkl_status was set to MKL_CBWR_ERR_INVALID_INPUT. And the output was still the same.

Any ideas what I did wrong? Thanks again!