Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

MKL DFTI problems inside a parallel region - MKL 10.3

butette
Beginner
322 Views
I just upgrade from MKL 10.2 to 10.3 recently and noticed that the behavior of having
a DFTI call to setup a descriptor, inside an OpenMP parallel region has changed compared to
MKL 10.2. Previously if I have code like
!!$omp parallel do ....
do ....
call to Dfti descriptor setup
call to Dfti compute
enddo
performance is acceptable. Now with MKL 10.3, it seems like there is some sort of synchronization
inside the descriptor setup phase, and the performance drops dramatically, to the point that it behaves
as if I do not have any parallelism, esp when the sizes is small (I am doing complex 3D FFT). If I move
the descriptor setup phase outside the OpenMP region, the performance is back to what it was before
with MKL 10.2, may be a little better.
Anybody noticed this behavior ?
0 Kudos
4 Replies
IDZ_A_Intel
Employee
322 Views
There have been a number of optimizations in different updates to MKL 10.3 -- please let us know which update you're using.
You mayfind useful the following article on calling MKL FFTs from OpenMP-parallelized code http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/.
0 Kudos
Ying_H_Intel
Employee
322 Views
Hi Butette,

I recalled there is similiar reports before. The problem is mainly be the descriptorDFTI_HANDLE, which will be used inall threads. So either as you did,
move the descriptror setup outside the OpenMP region.
or
would you like to try add before DFTICOMMITDESCRIPTOR:
STATUS = DFTISETVALUE(DFTI_HANDLE, DFTI_NUMBER_OF_USERS_THREADS, 4) !!! if 4 threads is used, depends on your number of CPUs, HT:on|off etc.

Regards,
Ying H.
0 Kudos
butette
Beginner
322 Views
Hi Ying,
Thanks for the feedback... I don't believe that is the problem though. Something clearly has changed regarding this behavior between 10.2 and 10.3 (any update, even the latest one 10.3 Update 4). As it stands Example 2
will have horrible performance with 10.3 (anay update), because that's exactly what I was doing before. The same code linked with 10.2 is probably 20-30 times faster than 10.3 when the transform sizes are small. I hvae tried it both with DFTI_NUMBER_OF_USERS_THREADS set to 1 or whatever the number of threads I am using... it won't matter.
Clearly there is a bug in the descriptor setup when it is inside a parallel region.
0 Kudos
barragan_villanueva_
Valued Contributor I
322 Views
Hi,

In general, DFTI descriptor setup should be used outside of parallel region. This will allow to get improvements of your application reusing the same descriptor from different threads via DFTI_NUMBER_OF_USERS_THREADS.
But, could you please share with us small reproducer to analyze your problem on our side?
0 Kudos
Reply