I recalled there is similiar reports before. The problem is mainly be the descriptorDFTI_HANDLE, which will be used inall threads. So either as you did, move the descriptror setup outside the OpenMP region. or would you like to try add before DFTICOMMITDESCRIPTOR: STATUS = DFTISETVALUE(DFTI_HANDLE, DFTI_NUMBER_OF_USERS_THREADS, 4) !!! if 4 threads is used, depends on your number of CPUs, HT:on|off etc.
Thanks for the feedback... I don't believe that is the problem though. Something clearly has changed regarding this behavior between 10.2 and 10.3 (any update, even the latest one 10.3 Update 4). As it stands Example 2
will have horrible performance with 10.3 (anay update), because that's exactly what I was doing before. The same code linked with 10.2 is probably 20-30 times faster than 10.3 when the transform sizes are small. I hvae tried it both with DFTI_NUMBER_OF_USERS_THREADS set to 1 or whatever the number of threads I am using... it won't matter.
Clearly there is a bug in the descriptor setup when it is inside a parallel region.
In general, DFTI descriptor setup should be used outside of parallel region. This will allow to get improvements of your application reusing the same descriptor from different threads via DFTI_NUMBER_OF_USERS_THREADS. But, could you please share with us small reproducer to analyze your problem on our side?