MKL Dfti using OpenMP

david47 · ‎02-01-2021

Dear Intel Forum,

I am trying to run the MKL DFTI functions using OpenMP and have not been able to get the anticipated speed up with my computer that has 24 threads. For some reason, I get a plateau
at 10 threads such that an increase of threads gives me no additional speed up even though my problem is extremely large. I have a very long complex vector which contains many thousands
of small groups that need to be FFT'ed. The vector looks like,

A = [a1_1 a2_1 a3_1 a4_1 b1_1 b2_1 b3_1 ... a1_2 a2_2 a3_2 a4_2 b1_2 b2_2 b3_2 ...]

The values of say, a1_1, a2-1, ... are different than a1_2, a2_2, ... but their lengths and separation are constant over the length of the vector A.

I have tried computing the FFT's two ways. My goal is to FFT all the "a" groups followed by the "b" groups and so on. I first tried using all Nthreads to solve each FFT and then tried using one thread for all groups but using an OpenMP loop to compute the FFT's in parallel.

The pertinent part of my codes look like the following,

Case 1:
TYPE(ptr__DFTI) :: ptr_fft(:) ! created a structure for an array of TYPE(DFTI_DESCRIPTOR), POINTER :: ptr_fft
INTEGER :: M1
INTEGER :: howmany
INTEGER :: Nelem(M1)
INTEGER :: mm
INTEGER :: idist
INTEGER :: Nthread

call OMP_set_num_threads(Nthread)

idist = ! constant distance between each groupd of FFT's
howmany = 4000
Nelem = [3, 5, 11, 20, ...] ! M1 values
do mm = 1,M1
ptr => null()
ier = DftiCreateDescriptor(ptr_fft(mm)%ptr_fft, DFTI_DOUBLE, DFTI_COMPLEX, 1, Nelem(mm))
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_TRANSFORMS, howmany)
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_INPUT_DISTANCE, idist)
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_USER_THREADS, Nthread)
ier = DftiCommitDescriptor(ptr_fft(mm)%ptr_fft)
end do

I run the code as ("ib" is an integer vector that gives the starting location in A for the first group
of values, i.e., those having _1),

do i = 1,M1
ier = DftiComputeForward(ptr_fft(i)%ptr_fft, A(ib(i):), A(ib(i):)) ! in place FFT
end do

Case 2
This is almost identical to Case 1 but Nthread is set to 1 to create each Dfti pointer but I
evaluate the FFT's in an OpenMP loop,

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (ithread, i, ier)
!DEC$ IF DEFINED (_OPENMP)
ithread = OMP_get_thread_num()+1
!DEC$ ELSE
ithread = 1
!DEC$ ENDIF
!$OMP DO SCHEDULE(static)
do i = 1,M1
ier = DftiComputeForward(ptr_fft(i)%ptr_fft, ethetaphi_work(ib_dfti(i):))
end do
!$OMP END DO
!$OMP END PARALLEL

Case 1 is a little faster than Case 2 but even then, it does not give me the speed ups I was
anticipating given the extremely large number of FFT's I was trying to compute.

Have I coded this up correctly? I have read the online documentation but have not found any examples exactly like mine. I have not been able to figure out why I am not getting faster results.

I hope I have been clear on my explanation. Any help would be greatly appreciated.
Thank you very much.

David

AbhishekD_Intel · ‎02-02-2021

Hi David,

Thanks for reaching out to us.

We are forwarding your issue to the SME.

Warm Regards,

Abhishek

Gennady_F_Intel · ‎02-02-2021

It seems you call batch mode ( size of the batch is 4000).

What is the max size of Nelem(mm)?

It looks like the memory bandwidth problem on this specifiec system.

david47 · ‎02-02-2021

Dear Gennady_F_Intel,

Thank you so much for getting back to me!

My problem is a series of calls to Dfti where I start with a large number of small vectors to FFT (ie., Nelem(i) is a small number, perhaps somewhere in [3 100] but how many can be many thousands, perhaps up to say 10000. For the small problem I am running now (on a smaller computer having 6 threads), I am doing 2304, 25 length FFT's up to 136, 51 length FFT's.

Is there a limit or "rule-of-thumb" for the optimal performance of Dfti given the number of threads, length of each transform, and the number of transforms? Is it better to set the number of threads to 1 when building the pointer and compute the Dfti in a parallel loop or throw all the threads at the Dfti and wrap a non-parallelized loop around the call to compute the FFT's?

Thanks you so much for your help.

BTW, I should add that I am running on Windows 10 64-bit, MSVS 2017, Intel Visual Fortran 2020, MKL 2020.

Sincerely,

David

Gennady_F_Intel · ‎02-05-2021

David, I asked FFT developers to address these questions and hope they will help you.

david47 · ‎02-05-2021

Hi Gennady_F_Intel!

Thank you for your help. I hope they can help soon as we are in need of a big FFT speed up using our large multi-threaded machine to solve a current problem.

Take good care.

Sincerely,

David

Gennady_F_Intel · ‎02-08-2021

We would ask about the hardware. Do 24 threads mean 12 cores? Is it NUMA? What’s the L3 cache size?

david47 · ‎02-08-2021

Hi Gennady_F_Intel!

I am running 2 different sized problems on 2 different computers but getting the same result on both, i.e., as the number of threads is increased we see no large speed up in the answer. For the small problem, I see about a factor of 2 speed up using 6 threads. The problem size is large enough that it should be able to use even more than 6 threads to make a difference. On the large problem running on the big computer, we see no additional speed up past 10 threads and just as on the small computer, the large problem is large enough that it should be able to use much more than 10 threads.

The data you asked for:

1) small computer: 6 cores, L3 cache: 20 MB, NUMA = no, 12 threads

2) large computer: 24 cores, L3 cache: 60 MB, NUMA = yes, 48 threads

Thank you again. really looking forward to hearing from you again soon!

Sincerely,

David

Gennady_F_Intel · ‎02-10-2021

The general suggestion to call the batched API rather than parallelize multiple FFT’s yourself.
Threading: MKL internally estimates the number of threads that will be most effective, and we don’t always use the total number of threads available. In particular, since FFT is memory bound, it doesn’t increase performance to run more than one thread per core. If his machine is 6 cores, we aren’t running more than 6 threads. Definitely, we would expect a plateau beyond nthreads = ncores.
L3 cache: We also don’t expect perfect scaling for FFT because it’s memory bound. If the total work (FFT size * batch size) is so large that it doesn’t fit in L3, a performance profile that starts leveling off before nthreads = ncores isn’t so surprising. The compute resources aren’t the bottleneck.

david47 · ‎02-10-2021

Thank you all for the help! So, it appears that despite setting DFTI_NUMBER_OF_USER_THREADS to be equal to the maximum number of threads, MKL will make a decision on how many of these threads to use to solve the given problem. On our small problem, it appears that the maximum speed up was achieved using only 2 out of 6 threads.

I ran a small test problem and it also showed that our speed significantly reduced using a stride > 1 ... the lesson here is keep the data contiguous!

Thank you again for all of your help.

Sincerely,

David

Gennady_F_Intel · ‎02-25-2021

Thanks for an update. The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.