DFTI - number of user threads.

t_clark · ‎09-17-2011

Hi all,

I'm trying to use DFTs in a threaded application. Many executions (10000+) each with the same size transform. So I'm using a common descriptor to prevent committing 10,000 times.

The use of a common descriptor is described by Intel#'s article here (case #4)...
http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/

...but I have a question based on the number of threads allowed.

The code (case #4) on the example page is either trivial or lazy - the actual number of OMP threads is the same as the maximum number of OMP threads, which in turn is the same as the number of FFTs executed.

So, my FFTs are contained in an OMP parallel DO loop which executes its contents 10,000 times on (say) an 8 core machine.

Maybe I want to reserve a core or two for other functions. My OMP_MAX_NUM_THREADS environment variable will be 8, my OMP_NUM_THREADS will be 6.

The question is: "what value should DFTI_NUMBER_OF_USER_THREADS parameter take?"

Does it have to be 10,000 (one for each DFT execution), or does it have to be 6 (one for each simultaneously running thread).

Alternatively, will it still work if I set it to 8 (the maximum physically allowed) whilst the actual number of threads which will execute is 6?

Thanks for any insight you can give!

Kind regards

Tom Clark

VipinKumar_E_Intel · ‎09-17-2011

Did you have a chance to look a the the MKL reference manual (which has more recent update, we will be updating the article soon as well) on DFT threading?

http://software.intel.com/sites/products/documentation/hpc/mkl/updates/10.3.5/mklman/appendices/mkl_appC_DFTMT.htm#appC-exC-22

http://software.intel.com/sites/products/documentation/hpc/mkl/updates/10.3.5/mklman/fft/fft_NumberOfThreads.htm

--Vipin

t_clark · ‎09-17-2011

Hi Vipin, thanks for responding.

Yes, I took a close look at both of those sources before posting.

In the first one, the number of OMP threads (integer nth in that code) is the same as the number of executions of the FFT within the DO loop ( Do ith = 1,nth). So it can't answer my question.

The second link is mostly parsed from the article I cited earlier (or vice-versa ;) and is ambiguous about whether DFTI_NUMBER_OF_USER_THREADS must be set to the number of threads which can be executed simultaneously, or the total number of threads spawned during a parallel region.

Giving a more basic example... is the following pseudocode flawed?...

Thanks, and kind regards

Tom

nWorkers = omp_get_max_threads() ! =8 for my dual quad core system
nFFTs = 10000 ! typically - not always exactly

[... create descriptors ...]
status = DftiSetValue (descriptorHandle DFTI_NUMBER_OF_USER_THREADS, nWorkers)
[... commit descriptors ...]

!$OMP PARALLEL DO SHARED(descriptorHandle, nFFTs, nWorkers) PRIVATE(someData)
DO fftCtr = 1,nFFTs ! <------- NOTE --- DIFFERENT TO nWorkers

call getSomeData(someData,fftCtr)

call fftTheData(descriptorHandle,someData)

ENDDO
!$OMP END PARALLEL DO

SUBROUTINE fftTheData(descriptorHandle, someData)
[... declarations]
status = DftiComputeForward (descriptorHandle, someData)
[... do stuff with the data and return]
END SUBROUTINE fftTheData


*Edit corrected a bug in the psedocode!!!!

barragan_villanueva_ · ‎09-17-2011

Hi,

To limit number of threads for FFT domain please use MKL service function
mkl_domain_set_num_threads(, MKL_FFT)
or set env accordinally
MKL_DOMAIN_NUM_THREADS=MKL_FFT=
See MKL doc for details

t_clark · ‎09-19-2011

Hi,

Thanks again but that still isn't my point - the purpose of that is for setting the number of threads that the MKL libraries use internally (i.e. for each FFT to do, how many threads are used to compute it).

In my case, I'd set it to 1, but I'm linking against the sequential library anyway - so each FFT forced to stay within it's own single thread.

Lacking documentation, I've just been trying things out. For anyone else trying to answer this question, I think the answer is to set DFTI_NUMBER_OF_USER_THREADS to the same value as omp_get_max_threads().

I figure that the descriptors contain data reserved so that at an instant in time, any thread which is running has access to a private area of data. Thus I don't need to set DFTI_NUMBER OF_USER_THREADS to 10000 (the total number of threads which will execute), but only to 8 (the max number of threads which can execute simultaneously).

However, I'm still really unsure on this - because I don't know what happens at the end (e.g. if I execute 9 FFTs with DFTI_NUMBER_OF_USER_THREADS set to 8, will the 9th one work reliably?)

If anyone knows the answer to this, I'd really appreciate confirmation - at the moment I'm just hoping for the best.

Cheers,

Tom

Evgueni_P_Intel · ‎09-19-2011

Hi t_clark,

If we go back to the original question "what value should DFTI_NUMBER_OF_USER_THREADS parameter take?", DFTI_NUMBER_OF_USER_THREADS should be set to the number of the OMP threads that your application uses to parallelize the OMP DO loop.

Another possibility for you would be to limit the number of threads for MKL FFTs as Victor suggests above, and do so-called multiple FFTs -- set DFTI_NUMBER_OF_TRANSFORMS. MKL will do parallelization.

Regarding your last question, yes,MKL guarantees correctness of the result if you do 9 FFTs in 8 threads.

E.

t_clark · ‎09-20-2011

Evgueni,

Thanks, that's answered my question completely. In my case, using OMP rather than setting the number of transforms > 1 is the best bet as it's not just the FFT that I'm parallelising - there's other work within the loop.

Now confident that my code is valid.

Thanks all, and kind regards

Tom