Intel python 2019 has poor FFT2 performance (7980XE 128GB win10) . FFT2 performance is about 50% of the anaconda stock numpy.
Stock numpy takes 560ms to finish the test
import numpy as np a= np.random.random([8000,8000])
Intel python 2019 takes 980ms:
import numpy as np
import mkl_fft as intel a=np.random.random([8000,8000]) %timeit b=intel.fft2(a)
7980XE cpu usage is only around 30% when using MKL, stock numpy can push 7980XE to 50%.
Please fix this issue, giving up Intel python for now.
Thanks for taking the time to bring this to our attention.
mkl_fft in Anaconda distribution is based on the same sources as mkl_fft in Intel (R) distribution for Python.
The difference is in the compiler used to build native extension. Anaconda's binaries are built with Gnu C Compiler, while Intel's are built with Intel (R) C Compiler.
In your example, the input is a real matrix, and fft2 is computing the full complex FFT.
Under the hood, mkl_fft uses MKL real domain FFT, which only produces harmonics up to Nyquist frequencies, and algebraically dependent harmonics need to reconstructed by complex conjugation and rearrangements. This loop is performed sequentially via an n-tuple iterator, and it appears that GCC is evidently producing a better performing native code.
I intend to seek guidance from the compiler team, but meanwhile, mkl_fft package from Intel channel can be swapped in favor of the one from either the conda-forge or the defaults channels.
Thank you Oleksandr!
One more observation:
MKL_fft only uses 25~30% of all threads on 7980XE (fully using 9~10 cores on a 18 core CPU), but GCC can use 50~60% (essentially all cores), It seems to me that the possibilities could be:
I hope above info can further help Intel engineers. The combination of latest Intel CPU and compiler/software always gave us the very best performance, we are just not used to "Intel product not taking performance crown".