Difference between 1 thread and multiple threads for 3-D FFT MKL real->complex transform

vbashkardin · ‎01-31-2025

I have a simple MKL code that does a 3-D MKL FFT transform in place real to complex and back. I see that the results are slightly different when comparing the output from 1 OpenMP thread to multiple threads. There is no difference between 2 vs 3 or any other multiple threads.

I am attaching a simple reproducer that stores outputs into files and then compares them between 1 thread and multiple threads. For the 1vs2 threads comparison, numpy.allclose() fails, and the RMS of difference is non-zero.

$ icpx -v
Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308)

$ uname -a
Linux hostname 4.18.0-553.5.1.el8_10.x86_64 #1 SMP

$ head /proc/cpuinfo | grep "model name"
model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

The same test passes 1vs2 threads output on a non-Intel machine with no numerical difference between the outputs. The only difference on that machine is the CPU:

$ head /proc/cpuinfo | grep "model name"
model name : AMD EPYC 7302P 16-Core Processor

I would appreciate any guidance regarding eliminating that difference on the Intel proc if possible.

Thank you.

Chao_Y_Intel · ‎02-05-2025

Hi,

What numerical difference did you get there?

I noticed a very small difference in the test results here. For this RMS, it looks fine:
RMS of difference: 1.5889534e-10

Since this computation uses single-precision floating-point, and the fractional part is 23 bits, giving it a precision of about 1e-7 to 1e-6 in the computation.

I made a minor change to the np.allclose() code, and it passes the test:

print("Allclose: ", np.allclose(array1, array2, atol=1e-07))

thanks,
Chao