Re:MKL results reproducibility

Mariuccio · ‎06-01-2021

We are facing an unexpected behavior by using the MKL library. Our application makes use of the MKL to solve a sparse linear system with the Conjugate Gradient method (see attached file, which is basically a simple variant of one of the sample codes).

By changing the total number of MKL threads we obtain slightly different results over different systems and on the same CPU as well. Both CPUs are Intel Xeon and both of them support the AVX2 instruction set (see below).

In particular, if we use just one thread the application produces the same identical result on both the CPUs, whereas if we increase the number of threads up to 8 the results slightly change. To fix the number of threads we use the environment variable MKL_NUM_THREADS.

To force the CPUs to use the same vec instructions we set the MKL_CBWR variable equal to "AVX2".
We observed the same unexpected behavior by changing the number of threads on the same CPU.
The MKL library version is 2019.0.117 whereas the compiler version is gcc 4.8.5. The binary program is statically linked.

Is there any other env variable o "trick" we can play to get the same results on the two systems?

JohnNichols · ‎06-02-2021

we obtain slightly different results over different systems and on the same CPU as well

The question I would ask as someone who does a lot of stats and numbers change even the same number in computers, what is the difference and more importantly is it statistically important.

It is akin to say 0 is zero, it really is not, unless it is an int.

JohnNichols · ‎06-02-2021

I was reminded of this yesterday in the supermarket, you know the 1880s version of Amazon for us old timers, when a young lady had the T Shirt that said

f(x) = |x| get rid of the negativity, I wanted to say sometimes a very small negative zero is actually positive statistically. But I knew she was a math major and probably believes in the sanctity of numbers.

Mariuccio · ‎06-03-2021

Dear John, thanks for your reply. You are a Valued Contributor and probably English is your first language and I greatly respect you.

However, I have a hard time understanding your humor. My post probably is not in perfect English but it should be clear enough that

we expect EXACTLY the same results (bit-to-bit identical) since:

1)The executable file is the SAME on the two systems (statically linked. ldd reports "not a dynamic executable").

2)We are using the SAME number of MKL threads on the two systems (8).

3)We force the MKL_CBWR=AVX2 env variable and the avx2 is supported by the two CPU (both Xeon although different models)

4)(not mentioned in the previous post), we double-checked that there are no uninitialized variables (using valgrind)

5)Using ONE thread we obtain EXACTLY (bit-to-bit identical) results on the two systems.

Now, since you are a Valued Contributor I, do you have an explanation of the fact that we obtain "different" (even if it were a single bit) results on the two systems. Do you have a suggestion to address the issue or do you have just another joke to suggest?

Thanks again for your collaboration and best regards,

Mariuccio

MRajesh_intel · ‎06-03-2021

Hi,

Can you please provide details of build steps and prerequisite files for running the code?

Regards

Rajesh.

Mariuccio · ‎06-03-2021

Hi Rajesh,

We could share with you the executable statically linked and the inputs needed by the application. The size of the data and application is roughly 120MB. Is it 0k for you?

All the best,

Mariuccio

MRajesh_intel · ‎06-03-2021

Hi,

If possible try to share a minimal reproducer code. Otherwise, you can share it in .zip format.

Regards

Rajesh.

Mariuccio · ‎06-03-2021

Dear Rajesh,

I share with you the application and the data it needs to be executed. The application has been compiled for Linux OS. To run the application you just need to unzip the archive, open a terminal in the directory of the unzipped files and then:

set the MKL and OMP num threads to 1, instruction set to AVX2 and run application
- export MKL_NUM_THREADS=1
- export OMP_NUM_THREADS=1
- export MKL_CBWR=AVX2
- ./myapp -aff transform.mat -ref ref_image.nii -in input.nii -config config.ini -cout img1.nii -intout intensities1 -iout iout1.nii
set the MKL and OMP num threads to 8, instruction set to AVX2 and run application
- export MKL_NUM_THREADS=8
- export OMP_NUM_THREADS=8
- export MKL_CBWR=AVX2
- myapp -aff transform.mat -ref ref_image.nii -in input.nii -config config.ini -cout img8.nii -intout intensities8 -iout iout8.nii

to see the differences you can compare intensities.txt file.

If you need more information, please let me know.

All the best,

Mariuccio

MRajesh_intel · ‎06-04-2021

Hi,

>>Is there any other env variable o "trick" we can play to get the same results on the two systems?

You may try setting MKL_DYNAMIC and OMP_DYNAMIC to FALSE and verify the results. In case you need bitwise reproducible results, try using the strict CNR Mode. Set the flag to MKL_CBWR = AVX2, STRICT. Please refer to the below documentation link for more information.

Link: https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/conditional-numerical-reproducibility-control/mkl-cbwr-set.html

Please let us know if you face any issues.

Regards

Rajesh.

Mariuccio · ‎06-04-2021

Hi Rajesh,

thank you for your feedback and your help I really appreciated. I would like to ask you to share with us the files you obtained by running the application, since the 8 threads and 1 thread run gave you same results you can share with us:

img8.nii
intensities8.nii.gz and intensities8.txt
iout8.nii

All the best,

Mariuccio

MRajesh_intel · ‎06-06-2021

Hi,

Below are the files when ran with Strict CNR mode. Let us know if you had any issues.

Regards

Rajesh.

MRajesh_intel · ‎06-14-2021

Hi,

Can you please update whether your issue has been resolved?

And in strict CNR mode, Intel® oneAPI Math Kernel Library provides bitwise reproducible results for a limited set of functions and code branches even when the number of threads changes.

For further reference please visit the link below:

https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/conditional-numerical-reproducibility-control/reproducibility-conditions.html

Thanks

Rajesh.

Mariuccio · ‎06-14-2021

Hi Rajesh,

unfortunately the proposed solution didn't work in our case. We obtain different results on different machines. I provide you the lscpu results:

First CPU:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Stepping: 3
CPU MHz: 1600.126
CPU max MHz: 3700.0000
CPU min MHz: 800.0000
BogoMIPS: 5616.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
e

second CPU:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
Stepping: 7
CPU MHz: 1015.942
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0-27,56-83
NUMA node1 CPU(s): 28-55,84-111
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

Regards,

Mariuccio

MRajesh_intel · ‎06-28-2021

Hi,

As per the Conditional Numerical Reproducibility of MKL, to ensure Intel MKL calls return the same results on every Intel CPU supporting AVX2 instructions make sure that your application uses a fixed number of threads. Using CBWR ( conditional reproducibility) provides bit_to_bit results when the number of OpenMP threads are the same from run to run. For this, set the environment variables to:

export MKL_NUM_THREADS=8

export OMP_NUM_THREADS=8

export MKL_DYNAMIC=FALSE

export OMP_DYNAMIC=FALSE

export MKL_ENABLE_INSTRUCTIONS=AVX2

export MKL_CBWR=AVX2

Please let us know if you want any more information.

There is a file attached to this post with the results.

Regards

Rajesh.

Mariuccio · ‎06-28-2021

Hi Rajesh,

we set the variable as you suggested but it doesn't work. We do not need more information, thank you for trying to help us.

All the best,

Mariuccio

MRajesh_intel · ‎06-28-2021

Hi,

Since you don't need any additional assistance from Intel, we will no longer respond to this thread. Please start a new thread if you need any further information. Any further interaction in this thread will be considered community only.

Have a Good day.

Regards

Rajesh

Ronny123 · ‎07-11-2021

Hi Mariuccio,

I have been facing the same problem few days back. I took help from following articles. (see below)

These are very useful.

https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.5ms0qx

https://www2.cisl.ucar.edu/resources/software/math-kernel-library-mkl

I hope you these articles help you.