IMKL AVX2 DFT slower in 2019.0.4 than in 2017.0.3

nicpac22 · ‎10-21-2019

Hi,

I've recently upgraded from IMKL 2017.0.3 (w/compiler: icpc 2017u4) to IMKL 2019.0.4 (w/compiler: icpc 2019u4) and noticed that one of my programs takes ~50% longer to run. Using callgrind and perf top, I've traced the issue down to the amount of time/cycles spent in DftiComputeBackward on a complex DFT. The DFT size is small (8192) and In both cases, the inverse DFT is called ~24 million times, however in IMKL 2019 a significant amount of CPU is spent in the following methods:

compute_colbatch_bwd

mkl_dft_avx2_coDFTColTwid_Compact_Bwd_v_16_s

mkl_dft_avx2_coDFTColBatch_Compact_Bwd_v_32_s

Whereas in IMKL 2017 the inverse DFT time is spent in:

mkl_dft_avx2_compute_bwd_s_c2c_1d_o

mkl_dft_avx2_xipps_inv_rev_32fc

mkl_dft_avx2_ippsDFTOutOrdInv_CToC_32fc

mkl_dft_avx2_ippsFFTInv_CToC_32fc

The program I'm running these DFTs in is relatively large, however >70% of the program cycles are spent on these calls. I have attempted to reproduce the problem in a simple script that just calls DftiComputeBackward repeatedly in a loop with similar paramters but I am unable to reproduce the issue. I was wondering if someone could shed some light on the differences in the underlying DFT functions and why the 2019 version would be spending so much time in compute_colbatch_bwd while this does not even come up in the 2017 version. Any help would be appreciated. For what its worth, I am compiling on an AVX2 platform with -xHost and -O3. My program uses ~1000 DFTI descriptors to compute the DFTs repeatedly for ~1000 different data channels.

Thanks,

Nick

Gennady_F_Intel · ‎10-22-2019

Nick, yes, we slightly redesign the internal code and that's the reason why you may see different call stack. We actually, not aware of such kind of regression you reported, between versions 2017 and 2019. How could we reproduce the problem?

What is the CPU type you are running this code?

How did you link this application?

Could you give us all the input parameters? ( you may extract this data if you switch of MKL_VERBOSE mode).

Could you show us the pipeline you use to call FFT routine and how do you measure the performance?

nicpac22 · ‎10-22-2019

Hi Gennady,

Thanks for the reply. I'm on an E5-2697v3 intel CPU, my link flags are -liomp -lpthread -lm -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -lsvml -limf.

The tricky thing is that I can only reproduce the issue in a large/complex program that runs in a software defined radio framework similar to gnuradio. I'm still trying to come up with a contrived/short program that simply calls the IMKL DFT routines in a loop and reproduces the issue. My first attempt at this resulted in the 2019 and 2017 versions calling almost identical libraries and running at similar speeds. I will continue trying to reproduce so I can post a simple program to replicate the issue. I have not yet found what input parameters or settings cause the DftiComputeBackward function to call mkl_dft_avx2_coDFTColTwid_Compact_Bwd_v_16_s, so was hoping to find a way to force this behavior.

I'm measuring the performance by counting the program execution time in both CPU time and wall time. I've also tried running my program on a continuous data stream at a set rate and monitoring the amount of CPU used. In both cases, the 2019 version took more time/cpu to process the same amount of data.

Nick

Gennady_F_Intel · ‎10-22-2019

Hi Nick,

When you are counting the execution time of FFT, do you measure the execution time of DftiCompute[Forward/Backward ] part only OR for the whole FFT pipeline?

I am trying to understand how did you measure the performance and reproduce the problem on our side:

version#1:

DftiCreateDescriptor(..); DftiCommitDescriptor(...);

start1=dscend()

DftiComputeForward(...);

texec1 = dsecnd() - start1

DftiFreeDescriptor()

Version #2:

start2=dscend()

DftiCreateDescriptor(..);

DftiCommitDescriptor(...);

DftiComputeForward(...);

DftiFreeDescriptor()

texec2 = dsecnd() - start2

Gennady_F_Intel · ‎10-22-2019

.. and there is some kind of tips to take the input parameters by using mkl verbose mode. Actually, using MKL_VERBOSE environment variable may produce huge log file in the case if fft routine is called much time from loop cycles. To mitigate this problem, you may try slightly update your code by inserting mkl_verbose(true) / mkl_verbose(false) calls before and after DftiComputeForward calls.

the pipeline of this modification may look like as follow:

for( int i = 0; i < #of_fft_calls; i++){

if( 0 == i ) {

mkl_verbose(true);

DftiComputeForward(...)

mkl_verbose(false)

}

DftiComputeForward(...)

}

and as a results the verbose log file would be as smaller as possible....and you can share it with us.