topic FFT parallelization comparison in Intel® oneAPI Math Kernel Library

FFT parallelization comparison

Marshall__Michael_B — Wed, 03 Mar 2010 14:35:15 GMT

Has any done any comparison of the different parallelization techniques for the 1 D FFTs?

I'd like to know which method in general is faster? internal threading or user threading?

Currently my application is single-threaded but improving the FFT time would warrant the added complexity of handling our own threads for this app.

FFT parallelization comparison

Gennady_F_Intel — Wed, 03 Mar 2010 15:53:17 GMT

First of all, please look at the list of threaded functions into the version of MKL which you are using.You can find this list into MKL User's Guide ( see chapter 6 - "Using Intel MKL Parallelism").

There are some restrictions for this functionality, e.g for the latest MKL 10.2 Update 4 :

1D real-to-complex and complex-to-real transforms are not threaded.

1D complex-to-complex transforms using split-complex layout are not threaded.

Prime-size complex-to-complex 1D transforms are not threaded.

and etc..

--Gennady

FFT parallelization comparison

Marshall__Michael_B — Wed, 03 Mar 2010 17:48:23 GMT

I am using MKL 10.2.3; my specific use case is a 1 Dimensional complex to complex transform.

At the beginning of Chapter 6, the user guide says "FFT" is threaded. It does not mention any of the restrictions listed in your reply.

In previous versions the memory array given to the FFT needed to be a factor of 128 for best performance. Is this still the case?

Does running the transform using out-of-place versus in-place memory make a difference?

FFT parallelization comparison

barragan_villanueva_ — Thu, 11 Mar 2010 10:02:42 GMT

Hi,

You know, Gennady added that fragment with limitations from MKL 10.2.4 User's Gude.

As to memory alignments: 16, or 128 or even page-alignment should provide better performance because using vectorizing code in DFT-kernels and compact page migrations (I mean DTLB misses).

About comparison out-of-place versus in-place 1D

for small sizes the difference isnonsignificant (see below Gfs for 1thread):

Forward_DFT_C,     x210,    8.432,1th,1D,in-place
Forward_DFT_C,     x210,    8.502,1th,1D,out-of-place

Forward_DFT_C,     x504,    9.896,1th,1D,in-place
Forward_DFT_C,     x504,    9.885,1th,1D,out-of-place

but for large sizes (more that cache size) it will be significant difference

Forward_DFT_C,     x3211264,    4.196,1th,1D,in-place
Forward_DFT_C,     x3211264,    4.312,1th,1D,out-of-place

Forward_DFT_C,     x6250000,    3.763,1th,1D,in-place
Forward_DFT_C,     x6250000,    3.837,1th,1D,out-of-place