Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

MKL Dfti using OpenMP

david47
Beginner
1,281 Views

Dear Intel Forum,

I am trying to run the MKL DFTI functions using OpenMP and have not been able to get the anticipated speed up with my computer that has 24 threads.  For some reason, I get a plateau
at 10 threads such that an increase of threads gives me no additional speed up even though my problem is extremely large.   I have a very long complex vector which contains many thousands
of small groups that need to be FFT'ed. The vector looks like,


A = [a1_1 a2_1 a3_1 a4_1 b1_1 b2_1 b3_1 ... a1_2 a2_2 a3_2 a4_2 b1_2 b2_2 b3_2 ...]


The values of say, a1_1, a2-1, ... are different than a1_2, a2_2, ... but their lengths and separation are constant over the length of the vector A.

I have tried computing the FFT's two ways. My goal is to FFT all the "a" groups followed by the "b" groups and so on. I first tried using all Nthreads to solve each FFT and then tried using one thread for all groups but using an OpenMP loop to compute the FFT's in parallel. 
 
The pertinent part of my codes look like the following,

Case 1:
TYPE(ptr__DFTI) :: ptr_fft(:) ! created a structure for an array of TYPE(DFTI_DESCRIPTOR), POINTER :: ptr_fft
INTEGER :: M1
INTEGER :: howmany
INTEGER :: Nelem(M1)
INTEGER :: mm
INTEGER :: idist
INTEGER :: Nthread

call OMP_set_num_threads(Nthread)

idist = ! constant distance between each groupd of FFT's
howmany = 4000
Nelem = [3, 5, 11, 20, ...] ! M1 values
do mm = 1,M1
ptr => null()
ier = DftiCreateDescriptor(ptr_fft(mm)%ptr_fft, DFTI_DOUBLE, DFTI_COMPLEX, 1, Nelem(mm))
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_TRANSFORMS, howmany)
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_INPUT_DISTANCE, idist)
ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_USER_THREADS, Nthread)
ier = DftiCommitDescriptor(ptr_fft(mm)%ptr_fft)
end do


I run the code as ("ib" is an integer vector that gives the starting location in A for the first group
of values, i.e., those having _1),

do i = 1,M1
ier = DftiComputeForward(ptr_fft(i)%ptr_fft, A(ib(i):), A(ib(i):)) ! in place FFT
end do

Case 2
This is almost identical to Case 1 but Nthread is set to 1 to create each Dfti pointer but I
evaluate the FFT's in an OpenMP loop,

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (ithread, i, ier)
!DEC$ IF DEFINED (_OPENMP)
ithread = OMP_get_thread_num()+1
!DEC$ ELSE
ithread = 1
!DEC$ ENDIF
!$OMP DO SCHEDULE(static)
do i = 1,M1
ier = DftiComputeForward(ptr_fft(i)%ptr_fft, ethetaphi_work(ib_dfti(i):))
end do
!$OMP END DO
!$OMP END PARALLEL

Case 1 is a little faster than Case 2 but even then, it does not give me the speed ups I was
anticipating given the extremely large number of FFT's I was trying to compute.

Have I coded this up correctly? I have read the online documentation but have not found any examples exactly like mine.  I have not been able to figure out why I am not getting faster results.


I hope I have been clear on my explanation. Any help would be greatly appreciated.
Thank you very much.

David

0 Kudos
10 Replies
AbhishekD_Intel
Moderator
1,264 Views

Hi David,


Thanks for reaching out to us.

We are forwarding your issue to the SME.


Warm Regards,

Abhishek


0 Kudos
Gennady_F_Intel
Moderator
1,256 Views

It seems you call batch mode ( size of the batch is 4000).

What is the max size of Nelem(mm)?

It looks like the memory bandwidth problem on this specifiec system.



0 Kudos
david47
Beginner
1,247 Views

Dear Gennady_F_Intel,

 

Thank you so much for getting back to me!
 
My problem is a series of calls to Dfti  where I start with a large number of small vectors to FFT (ie., Nelem(i) is a small number, perhaps somewhere in [3 100] but how many can be many thousands, perhaps up to say 10000.  For the small problem I am running now (on a smaller computer having 6 threads), I am doing 2304, 25 length FFT's up to 136, 51 length FFT's.

Is there a limit or "rule-of-thumb" for the optimal performance of Dfti given the number of threads, length of each transform, and the number of transforms?  Is it better to set the number of threads to 1 when building the pointer and compute the Dfti in a parallel loop or throw all the threads at the Dfti and wrap a non-parallelized loop around the call to compute the FFT's?

Thanks you so much for your help.

BTW, I should add that I am running on Windows 10 64-bit, MSVS 2017, Intel Visual Fortran 2020, MKL 2020.

Sincerely,

David

 

0 Kudos
Gennady_F_Intel
Moderator
1,179 Views

David, I asked FFT developers to address these questions and hope they will help you.


0 Kudos
david47
Beginner
1,172 Views

Hi Gennady_F_Intel!

Thank you for your help.  I hope they can help soon as we are in need of a big FFT speed up using our large multi-threaded machine to solve a current problem.

Take good care.

Sincerely,

David

0 Kudos
Gennady_F_Intel
Moderator
1,156 Views

We would ask about the hardware. Do 24 threads mean 12 cores? Is it NUMA? What’s the L3 cache size?

0 Kudos
david47
Beginner
1,152 Views

Hi Gennady_F_Intel!

I am running 2 different sized problems on 2 different computers but getting the same result on both, i.e., as the number of threads is increased we see no large speed up in the answer.  For the small problem, I see about a factor of 2 speed up using 6 threads.  The problem size is large enough that it should be able to use even more than 6 threads to make a difference.  On the large problem running on the big computer, we see no additional speed up past 10 threads and just as on the small computer, the large problem is large enough that it should be able to use much more than 10 threads.

The data you asked for:

1) small computer: 6 cores, L3 cache: 20 MB, NUMA = no, 12 threads

2) large computer: 24 cores, L3 cache: 60 MB, NUMA = yes, 48 threads

Thank you again.  really looking forward to hearing from you again soon!

Sincerely,

David

0 Kudos
Gennady_F_Intel
Moderator
1,138 Views
  • The general suggestion to call the batched API rather than parallelize multiple FFT’s yourself.
  • Threading: MKL internally estimates the number of threads that will be most effective, and we don’t always use the total number of threads available. In particular,  since FFT is memory bound, it doesn’t increase performance to run more than one thread per core. If his machine is 6 cores, we aren’t running more than 6 threads. Definitely, we would expect a plateau beyond nthreads = ncores.
  • L3 cache: We also don’t expect perfect scaling for FFT because it’s memory bound. If the total work (FFT size * batch size) is so large that it doesn’t fit in L3, a performance profile that starts leveling off before nthreads = ncores isn’t so surprising. The compute resources aren’t the bottleneck.

0 Kudos
david47
Beginner
1,133 Views

Thank you all for the help!  So, it appears that despite setting DFTI_NUMBER_OF_USER_THREADS to be equal to the maximum number of threads, MKL will make a decision on how many of these threads to use to solve the given problem. On our small problem, it appears that the maximum speed up was achieved using only 2 out of 6 threads. 

I ran a small test problem and it also showed that our speed significantly reduced using a stride  > 1 ... the lesson here is keep the data contiguous! 

Thank you again for all of your help.

Sincerely,

David

 

0 Kudos
Gennady_F_Intel
Moderator
1,077 Views

Thanks for an update. The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.



0 Kudos
Reply