...but I have a question based on the number of threads allowed.
The code (case #4) on the example page is either trivial or lazy - the actual number of OMP threads is the same as the maximum number of OMP threads, which in turn is the same as the number of FFTs executed.
So, my FFTs are contained in an OMP parallel DO loop which executes its contents 10,000 times on (say) an 8 core machine.
Maybe I want to reserve a core or two for other functions. My OMP_MAX_NUM_THREADS environment variable will be 8, my OMP_NUM_THREADS will be 6.
The question is: "what value should DFTI_NUMBER_OF_USER_THREADS parameter take?"
Does it have to be 10,000 (one for each DFT execution), or does it have to be 6 (one for each simultaneously running thread).
Alternatively, will it still work if I set it to 8 (the maximum physically allowed) whilst the actual number of threads which will execute is 6?
Yes, I took a close look at both of those sources before posting.
In the first one, the number of OMP threads (integer nth in that code) is the same as the number of executions of the FFT within the DO loop ( Do ith = 1,nth). So it can't answer my question.
The second link is mostly parsed from the article I cited earlier (or vice-versa ;) and is ambiguous about whether DFTI_NUMBER_OF_USER_THREADS must be set to the number of threads which can be executed simultaneously, or the total number of threads spawned during a parallel region.
Giving a more basic example... is the following pseudocode flawed?...
Thanks, and kind regards
nWorkers = omp_get_max_threads() ! =8 for my dual quad core system nFFTs = 10000 ! typically - not always exactly
Thanks again but that still isn't my point - the purpose of that is for setting the number of threads that the MKL libraries use internally (i.e. for each FFT to do, how many threads are used to compute it).
In my case, I'd set it to 1, but I'm linking against the sequential library anyway - so each FFT forced to stay within it's own single thread.
Lacking documentation, I've just been trying things out. For anyone else trying to answer this question, I think the answer is to set DFTI_NUMBER_OF_USER_THREADS to the same value as omp_get_max_threads().
I figure that the descriptors contain data reserved so that at an instant in time, any thread which is running has access to a private area of data. Thus I don't need to set DFTI_NUMBER OF_USER_THREADS to 10000 (the total number of threads which will execute), but only to 8 (the max number of threads which can execute simultaneously).
However, I'm still really unsure on this - because I don't know what happens at the end (e.g. if I execute 9 FFTs with DFTI_NUMBER_OF_USER_THREADS set to 8, will the 9th one work reliably?)
If anyone knows the answer to this, I'd really appreciate confirmation - at the moment I'm just hoping for the best.
If we go back to the original question "what value should DFTI_NUMBER_OF_USER_THREADS parameter take?", DFTI_NUMBER_OF_USER_THREADS should be set to the number of the OMP threads that your application uses to parallelize the OMP DO loop.
Another possibility for you would be to limit the number of threads for MKL FFTs as Victor suggests above, and do so-called multiple FFTs -- set DFTI_NUMBER_OF_TRANSFORMS. MKL will do parallelization.
Regarding your last question, yes,MKL guarantees correctness of the result if you do 9 FFTs in 8 threads.
Thanks, that's answered my question completely. In my case, using OMP rather than setting the number of transforms > 1 is the best bet as it's not just the FFT that I'm parallelising - there's other work within the loop.