Community
cancel
Showing results for 
Search instead for 
Did you mean: 
butette
Beginner
81 Views

MKL DFTI problems inside a parallel region - MKL 10.3

I just upgrade from MKL 10.2 to 10.3 recently and noticed that the behavior of having
a DFTI call to setup a descriptor, inside an OpenMP parallel region has changed compared to
MKL 10.2. Previously if I have code like
!!$omp parallel do ....
do ....
call to Dfti descriptor setup
call to Dfti compute
enddo
performance is acceptable. Now with MKL 10.3, it seems like there is some sort of synchronization
inside the descriptor setup phase, and the performance drops dramatically, to the point that it behaves
as if I do not have any parallelism, esp when the sizes is small (I am doing complex 3D FFT). If I move
the descriptor setup phase outside the OpenMP region, the performance is back to what it was before
with MKL 10.2, may be a little better.
Anybody noticed this behavior ?
0 Kudos
4 Replies
IDZ_A_Intel
Employee
81 Views

There have been a number of optimizations in different updates to MKL 10.3 -- please let us know which update you're using.
You mayfind useful the following article on calling MKL FFTs from OpenMP-parallelized code http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/.
Ying_H_Intel
Employee
81 Views

Hi Butette,

I recalled there is similiar reports before. The problem is mainly be the descriptorDFTI_HANDLE, which will be used inall threads. So either as you did,
move the descriptror setup outside the OpenMP region.
or
would you like to try add before DFTICOMMITDESCRIPTOR:
STATUS = DFTISETVALUE(DFTI_HANDLE, DFTI_NUMBER_OF_USERS_THREADS, 4) !!! if 4 threads is used, depends on your number of CPUs, HT:on|off etc.

Regards,
Ying H.
butette
Beginner
81 Views

Hi Ying,
Thanks for the feedback... I don't believe that is the problem though. Something clearly has changed regarding this behavior between 10.2 and 10.3 (any update, even the latest one 10.3 Update 4). As it stands Example 2
will have horrible performance with 10.3 (anay update), because that's exactly what I was doing before. The same code linked with 10.2 is probably 20-30 times faster than 10.3 when the transform sizes are small. I hvae tried it both with DFTI_NUMBER_OF_USERS_THREADS set to 1 or whatever the number of threads I am using... it won't matter.
Clearly there is a bug in the descriptor setup when it is inside a parallel region.
barragan_villanueva_
Valued Contributor I
81 Views

Hi,

In general, DFTI descriptor setup should be used outside of parallel region. This will allow to get improvements of your application reusing the same descriptor from different threads via DFTI_NUMBER_OF_USERS_THREADS.
But, could you please share with us small reproducer to analyze your problem on our side?
Reply