Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.
424 Discussions

Low MKL FFT2 performance on 7980XE (intel python 2019)



Intel python 2019 has poor FFT2 performance (7980XE 128GB win10) . FFT2 performance is about 50% of the anaconda stock numpy.

Stock numpy takes 560ms to finish the test

import numpy as np a= np.random.random([8000,8000])

%timeit b=np.fft.fft2(a)

Intel python 2019 takes 980ms:

import numpy as np

import mkl_fft as intel a=np.random.random([8000,8000]) %timeit b=intel.fft2(a)

7980XE cpu usage is only around 30% when using MKL, stock numpy can push 7980XE to 50%.

Please fix this issue, giving up Intel python for now.

0 Kudos
2 Replies

Thanks for taking the time to bring this to our attention.

mkl_fft in Anaconda distribution is based on the same sources as mkl_fft in Intel (R) distribution for Python. 

The difference is in the compiler used to build native extension. Anaconda's binaries are built with Gnu C Compiler, while Intel's are built with Intel (R) C Compiler. 

In your example, the input is a real matrix, and fft2 is computing the full complex FFT.

Under the hood, mkl_fft uses MKL real domain FFT, which only produces harmonics up to Nyquist frequencies, and algebraically dependent harmonics need to reconstructed by complex conjugation and rearrangements. This loop is performed sequentially via an n-tuple iterator, and it appears that GCC is evidently producing a better performing native code.

I intend to seek guidance from the compiler team, but meanwhile, mkl_fft package from Intel channel can be swapped in favor of the one from either the conda-forge or the defaults channels.

0 Kudos

Thank you Oleksandr!

One more observation:

MKL_fft only uses 25~30% of all threads on 7980XE (fully using 9~10 cores on a 18 core CPU), but GCC can use 50~60% (essentially all cores), It seems to me that the possibilities could be:

  • MKL_fft maybe RAM bandwidth limited by quad-channel DDR4
  • MKL_fft thread dispatcher cannot fully use all 18 cores
  • Low performance may relates to AVX512:
    • I have also tried MATLAB2018b, the fft2 performance is also about 2X of Intel python, and it can push  7980XE usage to 50~60% like the anaconda python.  MATLA2018b uses an older version of MKL which does not support AVX512.

I hope above info can further help Intel engineers. The combination of latest Intel CPU and compiler/software always gave us the very best performance, we are just not used to "Intel product not taking performance crown".



0 Kudos