Dear Intel Forum,
I am trying to run the MKL DFTI functions using OpenMP and have not been able to get the anticipated speed up with my computer that has 24 threads. For some reason, I get a plateau
at 10 threads such that an increase of threads gives me no additional speed up even though my problem is extremely large. I have a very long complex vector which contains many thousands
of small groups that need to be FFT'ed. The vector looks like,
A = [a1_1 a2_1 a3_1 a4_1 b1_1 b2_1 b3_1 ... a1_2 a2_2 a3_2 a4_2 b1_2 b2_2 b3_2 ...]
The values of say, a1_1, a2-1, ... are different than a1_2, a2_2, ... but their lengths and separation are constant over the length of the vector A.
I have tried computing the FFT's two ways. My goal is to FFT all the "a" groups followed by the "b" groups and so on. I first tried using all Nthreads to solve each FFT and then tried using one thread for all groups but using an OpenMP loop to compute the FFT's in parallel.
The pertinent part of my codes look like the following,
TYPE(ptr__DFTI) :: ptr_fft(:) ! created a structure for an array of TYPE(DFTI_DESCRIPTOR), POINTER :: ptr_fft
INTEGER :: M1
INTEGER :: howmany
INTEGER :: Nelem(M1)
INTEGER :: mm
INTEGER :: idist
INTEGER :: Nthread
idist = ! constant distance between each groupd of FFT's
howmany = 4000
Nelem = [3, 5, 11, 20, ...] ! M1 values
do mm = 1,M1
ptr => null()
ier = DftiCreateDescriptor(ptr_fft(mm)%ptr_fft, DFTI_DOUBLE, DFTI_COMPLEX, 1, Nelem(mm))
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_TRANSFORMS, howmany)
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_INPUT_DISTANCE, idist)
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_USER_THREADS, Nthread)
ier = DftiCommitDescriptor(ptr_fft(mm)%ptr_fft)
I run the code as ("ib" is an integer vector that gives the starting location in A for the first group
of values, i.e., those having _1),
do i = 1,M1
ier = DftiComputeForward(ptr_fft(i)%ptr_fft, A(ib(i):), A(ib(i):)) ! in place FFT
This is almost identical to Case 1 but Nthread is set to 1 to create each Dfti pointer but I
evaluate the FFT's in an OpenMP loop,
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (ithread, i, ier)
!DEC$ IF DEFINED (_OPENMP)
ithread = OMP_get_thread_num()+1
ithread = 1
!$OMP DO SCHEDULE(static)
do i = 1,M1
ier = DftiComputeForward(ptr_fft(i)%ptr_fft, ethetaphi_work(ib_dfti(i):))
!$OMP END DO
!$OMP END PARALLEL
Case 1 is a little faster than Case 2 but even then, it does not give me the speed ups I was
anticipating given the extremely large number of FFT's I was trying to compute.
Have I coded this up correctly? I have read the online documentation but have not found any examples exactly like mine. I have not been able to figure out why I am not getting faster results.
I hope I have been clear on my explanation. Any help would be greatly appreciated.
Thank you very much.
Is there a limit or "rule-of-thumb" for the optimal performance of Dfti given the number of threads, length of each transform, and the number of transforms? Is it better to set the number of threads to 1 when building the pointer and compute the Dfti in a parallel loop or throw all the threads at the Dfti and wrap a non-parallelized loop around the call to compute the FFT's?
Thanks you so much for your help.
BTW, I should add that I am running on Windows 10 64-bit, MSVS 2017, Intel Visual Fortran 2020, MKL 2020.
We would ask about the hardware. Do 24 threads mean 12 cores? Is it NUMA? What’s the L3 cache size?
I am running 2 different sized problems on 2 different computers but getting the same result on both, i.e., as the number of threads is increased we see no large speed up in the answer. For the small problem, I see about a factor of 2 speed up using 6 threads. The problem size is large enough that it should be able to use even more than 6 threads to make a difference. On the large problem running on the big computer, we see no additional speed up past 10 threads and just as on the small computer, the large problem is large enough that it should be able to use much more than 10 threads.
The data you asked for:
1) small computer: 6 cores, L3 cache: 20 MB, NUMA = no, 12 threads
2) large computer: 24 cores, L3 cache: 60 MB, NUMA = yes, 48 threads
Thank you again. really looking forward to hearing from you again soon!
- The general suggestion to call the batched API rather than parallelize multiple FFT’s yourself.
- Threading: MKL internally estimates the number of threads that will be most effective, and we don’t always use the total number of threads available. In particular, since FFT is memory bound, it doesn’t increase performance to run more than one thread per core. If his machine is 6 cores, we aren’t running more than 6 threads. Definitely, we would expect a plateau beyond nthreads = ncores.
- L3 cache: We also don’t expect perfect scaling for FFT because it’s memory bound. If the total work (FFT size * batch size) is so large that it doesn’t fit in L3, a performance profile that starts leveling off before nthreads = ncores isn’t so surprising. The compute resources aren’t the bottleneck.
Thank you all for the help! So, it appears that despite setting DFTI_NUMBER_OF_USER_THREADS to be equal to the maximum number of threads, MKL will make a decision on how many of these threads to use to solve the given problem. On our small problem, it appears that the maximum speed up was achieved using only 2 out of 6 threads.
I ran a small test problem and it also showed that our speed significantly reduced using a stride > 1 ... the lesson here is keep the data contiguous!
Thank you again for all of your help.
Thanks for an update. The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.