- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am writing a code that uses many FFTs (billions) of fairly long sequences (2^p for p in the range 17-20, i.e. 100,000 - 1,00,000). It is crucial to find the fastest way to execute those. I read a technical report:
https://epubs.stfc.ac.uk/manifestation/45434584/RAL-TR-2020-003.pdf
that claims that the execution time for complex-to-complex FFTs with MKL is about twice as LOW as that of real-to-real FFTs of the same length (in particular, for a length of around 1,000,000).
This seems counter-intuitive, as the complex-to-complex transform should require about twice the number of operations. In fact, one can process two real-to-real transforms in one complex-to-complex call. I can imagine the execution time being roughly equal if complex arithmetic is performed in parallel in the micro-architecture, but it would puzzle me if it is lower.
My first question is simply if this is true.
My second question is what settings this result depends on (e.g. mkl_num_threads, compiler optimization, storage scheme, flags to encourage SIMD processing, et cetera).
If this is true for -Ofast compiling and when allowing MKL to grab any number of threads (up to the number of physical cores) then I will commit to using complex-to-complex transforms.
Postscript: I found the following remark:
community.intel.com/t5/Software-Archive/How-to-get-peak-performnace-in-FFT/m-p/975037#M24838
from an Intel programmer, stating that more energy has been invested in optimizing the complex-to-complex FFTs, which might be a clue. However, it does not explain how or state that complex-to-complex is faster.
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page