Relative performance of real and complex FFTs in MKL

Van_Veen__Lennaert · ‎03-19-2024

I am writing a code that uses many FFTs (billions) of fairly long sequences (2^p for p in the range 17-20, i.e. 100,000 - 1,00,000). It is crucial to find the fastest way to execute those. I read a technical report:

https://epubs.stfc.ac.uk/manifestation/45434584/RAL-TR-2020-003.pdf

that claims that the execution time for complex-to-complex FFTs with MKL is about twice as LOW as that of real-to-real FFTs of the same length (in particular, for a length of around 1,000,000).

This seems counter-intuitive, as the complex-to-complex transform should require about twice the number of operations. In fact, one can process two real-to-real transforms in one complex-to-complex call. I can imagine the execution time being roughly equal if complex arithmetic is performed in parallel in the micro-architecture, but it would puzzle me if it is lower.
My first question is simply if this is true.
My second question is what settings this result depends on (e.g. mkl_num_threads, compiler optimization, storage scheme, flags to encourage SIMD processing, et cetera).

If this is true for -Ofast compiling and when allowing MKL to grab any number of threads (up to the number of physical cores) then I will commit to using complex-to-complex transforms.

Postscript: I found the following remark:

community.intel.com/t5/Software-Archive/How-to-get-peak-performnace-in-FFT/m-p/975037#M24838

from an Intel programmer, stating that more energy has been invested in optimizing the complex-to-complex FFTs, which might be a clue. However, it does not explain how or state that complex-to-complex is faster.

Relative performance of real and complex FFTs in MKL

Performance