- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear Intel Forum,

I am trying to run the MKL DFTI functions using OpenMP and have not been able to get the anticipated speed up with my computer that has 24 threads. For some reason, I get a plateau

at 10 threads such that an increase of threads gives me no additional speed up even though my problem is extremely large. I have a very long complex vector which contains many thousands

of small groups that need to be FFT'ed. The vector looks like,

A = [a1_1 a2_1 a3_1 a4_1 b1_1 b2_1 b3_1 ... a1_2 a2_2 a3_2 a4_2 b1_2 b2_2 b3_2 ...]

The values of say, a1_1, a2-1, ... are different than a1_2, a2_2, ... but their lengths and separation are constant over the length of the vector A.

I have tried computing the FFT's two ways. My goal is to FFT all the "a" groups followed by the "b" groups and so on. I first tried using all Nthreads to solve each FFT and then tried using one thread for all groups but using an OpenMP loop to compute the FFT's in parallel.

The pertinent part of my codes look like the following,

Case 1:

TYPE(ptr__DFTI) :: ptr_fft(:) ! created a structure for an array of TYPE(DFTI_DESCRIPTOR), POINTER :: ptr_fft

INTEGER :: M1

INTEGER :: howmany

INTEGER :: Nelem(M1)

INTEGER :: mm

INTEGER :: idist

INTEGER :: Nthread

call OMP_set_num_threads(Nthread)

idist = ! constant distance between each groupd of FFT's

howmany = 4000

Nelem = [3, 5, 11, 20, ...] ! M1 values

do mm = 1,M1

ptr => null()

ier = DftiCreateDescriptor(ptr_fft(mm)%ptr_fft, DFTI_DOUBLE, DFTI_COMPLEX, 1, Nelem(mm))

ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_TRANSFORMS, howmany)

ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_INPUT_DISTANCE, idist)

ier = DftiSetValue(ptr_fft(mm)%ptr_fft, DFTI_NUMBER_OF_USER_THREADS, Nthread)

ier = DftiCommitDescriptor(ptr_fft(mm)%ptr_fft)

end do

I run the code as ("ib" is an integer vector that gives the starting location in A for the first group

of values, i.e., those having _1),

do i = 1,M1

ier = DftiComputeForward(ptr_fft(i)%ptr_fft, A(ib(i):), A(ib(i):)) ! in place FFT

end do

Case 2

This is almost identical to Case 1 but Nthread is set to 1 to create each Dfti pointer but I

evaluate the FFT's in an OpenMP loop,

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (ithread, i, ier)

!DEC$ IF DEFINED (_OPENMP)

ithread = OMP_get_thread_num()+1

!DEC$ ELSE

ithread = 1

!DEC$ ENDIF

!$OMP DO SCHEDULE(static)

do i = 1,M1

ier = DftiComputeForward(ptr_fft(i)%ptr_fft, ethetaphi_work(ib_dfti(i):))

end do

!$OMP END DO

!$OMP END PARALLEL

Case 1 is a little faster than Case 2 but even then, it does not give me the speed ups I was

anticipating given the extremely large number of FFT's I was trying to compute.

Have I coded this up correctly? I have read the online documentation but have not found any examples exactly like mine. I have not been able to figure out why I am not getting faster results.

I hope I have been clear on my explanation. Any help would be greatly appreciated.

Thank you very much.

David

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi David,

Thanks for reaching out to us.

We are forwarding your issue to the SME.

Warm Regards,

Abhishek

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It seems you call batch mode ( size of the batch is 4000).

What is the max size of Nelem(mm)?

It looks like the memory bandwidth problem on this specifiec system.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear Gennady_F_Intel,

Is there a limit or "rule-of-thumb" for the optimal performance of Dfti given the number of threads, length of each transform, and the number of transforms? Is it better to set the number of threads to 1 when building the pointer and compute the Dfti in a parallel loop or throw all the threads at the Dfti and wrap a non-parallelized loop around the call to compute the FFT's?

Thanks you so much for your help.

BTW, I should add that I am running on Windows 10 64-bit, MSVS 2017, Intel Visual Fortran 2020, MKL 2020.

Sincerely,

David

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

David, I asked FFT developers to address these questions and hope they will help you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Gennady_F_Intel!

Thank you for your help. I hope they can help soon as we are in need of a big FFT speed up using our large multi-threaded machine to solve a current problem.

Take good care.

Sincerely,

David

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

We would ask about the hardware. Do 24 threads mean 12 cores? Is it NUMA? What’s the L3 cache size?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Gennady_F_Intel!

I am running 2 different sized problems on 2 different computers but getting the same result on both, i.e., as the number of threads is increased we see no large speed up in the answer. For the small problem, I see about a factor of 2 speed up using 6 threads. The problem size is large enough that it should be able to use even more than 6 threads to make a difference. On the large problem running on the big computer, we see no additional speed up past 10 threads and just as on the small computer, the large problem is large enough that it should be able to use much more than 10 threads.

The data you asked for:

1) small computer: 6 cores, L3 cache: 20 MB, NUMA = no, 12 threads

2) large computer: 24 cores, L3 cache: 60 MB, NUMA = yes, 48 threads

Thank you again. really looking forward to hearing from you again soon!

Sincerely,

David

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- The general suggestion to call the batched API rather than parallelize multiple FFT’s yourself.
- Threading: MKL internally estimates the number of threads that will be most effective, and we don’t always use the total number of threads available. In particular, since FFT is memory bound, it doesn’t increase performance to run more than one thread per core. If his machine is 6 cores, we aren’t running more than 6 threads. Definitely, we would expect a plateau beyond nthreads = ncores.
- L3 cache: We also don’t expect perfect scaling for FFT because it’s memory bound. If the total work (FFT size * batch size) is so large that it doesn’t fit in L3, a performance profile that starts leveling off before nthreads = ncores isn’t so surprising. The compute resources aren’t the bottleneck.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you all for the help! So, it appears that despite setting DFTI_NUMBER_OF_USER_THREADS to be equal to the maximum number of threads, MKL will make a decision on how many of these threads to use to solve the given problem. On our small problem, it appears that the maximum speed up was achieved using only 2 out of 6 threads.

I ran a small test problem and it also showed that our speed significantly reduced using a stride > 1 ... the lesson here is keep the data contiguous!

Thank you again for all of your help.

Sincerely,

David

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for an update. The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page