- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We observed the same unexpected behavior by changing the number of threads on the same CPU.
The MKL library version is 2019.0.117 whereas the compiler version is gcc 4.8.5. The binary program is statically linked.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we obtain slightly different results over different systems and on the same CPU as well
The question I would ask as someone who does a lot of stats and numbers change even the same number in computers, what is the difference and more importantly is it statistically important.
It is akin to say 0 is zero, it really is not, unless it is an int.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was reminded of this yesterday in the supermarket, you know the 1880s version of Amazon for us old timers, when a young lady had the T Shirt that said
f(x) = |x| get rid of the negativity, I wanted to say sometimes a very small negative zero is actually positive statistically. But I knew she was a math major and probably believes in the sanctity of numbers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear John, thanks for your reply. You are a Valued Contributor and probably English is your first language and I greatly respect you.
However, I have a hard time understanding your humor. My post probably is not in perfect English but it should be clear enough that
we expect EXACTLY the same results (bit-to-bit identical) since:
1)The executable file is the SAME on the two systems (statically linked. ldd reports "not a dynamic executable").
2)We are using the SAME number of MKL threads on the two systems (8).
3)We force the MKL_CBWR=AVX2 env variable and the avx2 is supported by the two CPU (both Xeon although different models)
4)(not mentioned in the previous post), we double-checked that there are no uninitialized variables (using valgrind)
5)Using ONE thread we obtain EXACTLY (bit-to-bit identical) results on the two systems.
Now, since you are a Valued Contributor I, do you have an explanation of the fact that we obtain "different" (even if it were a single bit) results on the two systems. Do you have a suggestion to address the issue or do you have just another joke to suggest?
Thanks again for your collaboration and best regards,
Mariuccio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you please provide details of build steps and prerequisite files for running the code?
Regards
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajesh,
We could share with you the executable statically linked and the inputs needed by the application. The size of the data and application is roughly 120MB. Is it 0k for you?
All the best,
Mariuccio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If possible try to share a minimal reproducer code. Otherwise, you can share it in .zip format.
Regards
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Rajesh,
I share with you the application and the data it needs to be executed. The application has been compiled for Linux OS. To run the application you just need to unzip the archive, open a terminal in the directory of the unzipped files and then:
- set the MKL and OMP num threads to 1, instruction set to AVX2 and run application
- export MKL_NUM_THREADS=1
- export OMP_NUM_THREADS=1
- export MKL_CBWR=AVX2
-
./myapp -aff transform.mat -ref ref_image.nii -in input.nii -config config.ini -cout img1.nii -intout intensities1 -iout iout1.nii
- set the MKL and OMP num threads to 8, instruction set to AVX2 and run application
- export MKL_NUM_THREADS=8
- export OMP_NUM_THREADS=8
- export MKL_CBWR=AVX2
- myapp -aff transform.mat -ref ref_image.nii -in input.nii -config config.ini -cout img8.nii -intout intensities8 -iout iout8.nii
to see the differences you can compare intensities.txt file.
If you need more information, please let me know.
All the best,
Mariuccio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>Is there any other env variable o "trick" we can play to get the same results on the two systems?
You may try setting MKL_DYNAMIC and OMP_DYNAMIC to FALSE and verify the results. In case you need bitwise reproducible results, try using the strict CNR Mode. Set the flag to MKL_CBWR = AVX2, STRICT. Please refer to the below documentation link for more information.
Please let us know if you face any issues.
Regards
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajesh,
thank you for your feedback and your help I really appreciated. I would like to ask you to share with us the files you obtained by running the application, since the 8 threads and 1 thread run gave you same results you can share with us:
- img8.nii
- intensities8.nii.gz and intensities8.txt
- iout8.nii
All the best,
Mariuccio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you please update whether your issue has been resolved?
And in strict CNR mode, Intel® oneAPI Math Kernel Library provides bitwise reproducible results for a limited set of functions and code branches even when the number of threads changes.
For further reference please visit the link below:
Thanks
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajesh,
unfortunately the proposed solution didn't work in our case. We obtain different results on different machines. I provide you the lscpu results:
First CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Stepping: 3
CPU MHz: 1600.126
CPU max MHz: 3700.0000
CPU min MHz: 800.0000
BogoMIPS: 5616.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
e
second CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
Stepping: 7
CPU MHz: 1015.942
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0-27,56-83
NUMA node1 CPU(s): 28-55,84-111
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
Regards,
Mariuccio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As per the Conditional Numerical Reproducibility of MKL, to ensure Intel MKL calls return the same results on every Intel CPU supporting AVX2 instructions make sure that your application uses a fixed number of threads. Using CBWR ( conditional reproducibility) provides bit_to_bit results when the number of OpenMP threads are the same from run to run. For this, set the environment variables to:
export MKL_NUM_THREADS=8
export OMP_NUM_THREADS=8
export MKL_DYNAMIC=FALSE
export OMP_DYNAMIC=FALSE
export MKL_ENABLE_INSTRUCTIONS=AVX2
export MKL_CBWR=AVX2
Please let us know if you want any more information.
There is a file attached to this post with the results.
Regards
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajesh,
we set the variable as you suggested but it doesn't work. We do not need more information, thank you for trying to help us.
All the best,
Mariuccio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Since you don't need any additional assistance from Intel, we will no longer respond to this thread. Please start a new thread if you need any further information. Any further interaction in this thread will be considered community only.
Have a Good day.
Regards
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mariuccio,
I have been facing the same problem few days back. I took help from following articles. (see below)
These are very useful.
https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.5ms0qx
https://www2.cisl.ucar.edu/resources/software/math-kernel-library-mkl
I hope you these articles help you.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page