MKL DFTI problems inside a parallel region - MKL 10.3

butette · ‎07-18-2011

I just upgrade from MKL 10.2 to 10.3 recently and noticed that the behavior of having

a DFTI call to setup a descriptor, inside an OpenMP parallel region has changed compared to

MKL 10.2. Previously if I have code like

!!$omp parallel do ....

do ....

call to Dfti descriptor setup

call to Dfti compute

enddo

performance is acceptable. Now with MKL 10.3, it seems like there is some sort of synchronization

inside the descriptor setup phase, and the performance drops dramatically, to the point that it behaves

as if I do not have any parallelism, esp when the sizes is small (I am doing complex 3D FFT). If I move

the descriptor setup phase outside the OpenMP region, the performance is back to what it was before

with MKL 10.2, may be a little better.

Anybody noticed this behavior ?

IDZ_A_Intel · ‎07-18-2011

There have been a number of optimizations in different updates to MKL 10.3 -- please let us know which update you're using.
You mayfind useful the following article on calling MKL FFTs from OpenMP-parallelized code http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/.

Ying_H_Intel · ‎07-19-2011

Hi Butette,

I recalled there is similiar reports before. The problem is mainly be the descriptorDFTI_HANDLE, which will be used inall threads. So either as you did,
move the descriptror setup outside the OpenMP region.
or
would you like to try add before DFTICOMMITDESCRIPTOR:
STATUS = DFTISETVALUE(DFTI_HANDLE, DFTI_NUMBER_OF_USERS_THREADS, 4) !!! if 4 threads is used, depends on your number of CPUs, HT:on|off etc.

Regards,
Ying H.

butette · ‎07-20-2011

Hi Ying,

Thanks for the feedback... I don't believe that is the problem though. Something clearly has changed regarding this behavior between 10.2 and 10.3 (any update, even the latest one 10.3 Update 4). As it stands Example 2

http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/

will have horrible performance with 10.3 (anay update), because that's exactly what I was doing before. The same code linked with 10.2 is probably 20-30 times faster than 10.3 when the transform sizes are small. I hvae tried it both with DFTI_NUMBER_OF_USERS_THREADS set to 1 or whatever the number of threads I am using... it won't matter.

Clearly there is a bug in the descriptor setup when it is inside a parallel region.

barragan_villanueva_ · ‎07-20-2011

Hi,

In general, DFTI descriptor setup should be used outside of parallel region. This will allow to get improvements of your application reusing the same descriptor from different threads via DFTI_NUMBER_OF_USERS_THREADS.
But, could you please share with us small reproducer to analyze your problem on our side?