topic This a 1D computation. And in Intel® oneAPI Math Kernel Library

MKL FFT inside OpenMP loop (MKL 2018)

AndrewC — Sun, 25 Mar 2018 20:27:04 GMT

I have an openmp loop

#pragma openmp parallel for

for (int i=0;i<n;i++){

// routine that calls MKL FFT

}

The thread performance is pretty abysmal, on an 8 core machine, showing just over 1 core being used.

What is surprising is that Intel Amplifier shows that the time is spent in DftiCommitDescriptor, not the actual computation.

Function / Call Stack CPU Time Module Function (Full) Source File Start Address
DftiCommitDescriptor 83.7% mkl_rt.dll DftiCommitDescriptor [Unknown] 0x180a45b68

.....
DftiComputeForward 0.5% mkl_rt.dll DftiComputeForward [Unknown] 0x180a45f10

Any suggested best practices here. typically the FFT function will be called with the same data length, say ,10K-20K..

Hi vasci_

Ying_H_Intel — Mon, 26 Mar 2018 06:31:03 GMT

Hi vasci_

How do you link mkl and the FFT is 1D or 2D? If it is intel compiler and openmp, the code in parallel loop is supposed be run in serial.

According to "typically the FFT function will be called with the same data length", You may try put the DftiCommitDescriptor out of the openmp for loop and see if there any improvements.
or if needed, please submit one reproduce case to Online service center https://supporttickets.intel.com/?lang=en-US?

Moreover, MKL user guides have several using FFT in openmp parallel sample code for your reference:
https://software.intel.com/en-us/mkl-developer-reference-c-examples-of-using-openmp-threading-for-fft-computation

Best Regards,

Ying

This a 1D computation. And

AndrewC — Mon, 26 Mar 2018 14:00:30 GMT

This a 1D computation. And after changing the code to serial, DftiCommitDescriptor was still the bottleneck. Clearly moving the DftiCommitDescriptor outside of the loop would help - it is just a surprising result that DftiCommitDescriptor is so 'expensive'

Related to this I have found

AndrewC — Fri, 13 Apr 2018 20:13:00 GMT

Related to this I have found that after updating to MKL 2018 Update 2 and when a 1-D FFT is being called in a OpenMP parallel for loop I am getting a memory access exception.

The crash is deep inside mkl_avx.dll.

Removing the openmp directives stops the issue.

Following on my previous post

AndrewC — Mon, 23 Apr 2018 02:38:59 GMT

Following on my previous post. This is a typical crash occurring in Update 2 but not Update 1.

Basically I have to remove all FFT calls within OpenMP parallel regions to avoid these crashes.

CS = 0033 FS = 0053 GS = 002b

Stack Trace (from fault):
[ 0] 0x000007fed1e21b2a mkl_avx.dll+09181994 mkl_dft_avx_dft_zdscal+00000842
[ 1] 0x000007fed1fbcd9f mkl_avx.dll+10866079 mkl_sparse_d_csr_ctd_sv_ker_i8_avx+00578415
[ 2] 0x000007fed1e234c8 mkl_avx.dll+09188552 mkl_dft_avx_dfti_create_node+00000488
[ 3] 0x000007fed1e23af9 mkl_avx.dll+09190137 mkl_dft_avx_dfti_create_sr1d+00000073
[ 4] 0x000007fee03d75d2 mkl_rt.dll+10909138 fftwf_sprint_plan+00001134
[ 5] 0x000007fee03bfe9a mkl_rt.dll+10813082 DftiCreateDescriptor_s_1d+00000366
....
[ 8] 0x000007fee5330ecc libiomp5md.dll+00593612 _kmp_invoke_microtask+00000140
[ 9] 0x000007fee52fc37d libiomp5md.dll+00377725 _kmp_acquire_nested_drdpa_lock+00037421
[ 10] 0x000007fee52fb494 libiomp5md.dll+00373908 _kmp_acquire_nested_drdpa_lock+00033604
[ 11] 0x000007fee5332e87 libiomp5md.dll+00601735 _kmp_launch_worker+00000407
[ 12] 0x00000000773859cd C:\Windows\system32\kernel32.dll+00088525 BaseThreadInitThunk+00000013
[ 13] 0x00000000775ba561 C:\Windows\SYSTEM32\ntdll.dll+00173409 RtlUserThreadStart+00000033