<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Relative performance of real and complex FFTs in MKL in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Relative-performance-of-real-and-complex-FFTs-in-MKL/m-p/1581786#M35933</link>
    <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;I am writing a code that uses many FFTs (billions)&amp;nbsp; of fairly long sequences (2^p for p in the range 17-20, i.e. 100,000 - 1,00,000). It is crucial to find the fastest way to execute those. I read a technical report:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;A class="" href="https://epubs.stfc.ac.uk/manifestation/45434584/RAL-TR-2020-003.pdf" target="_blank" rel="noopener nofollow noreferrer"&gt;&lt;SPAN class=""&gt;https://epubs.stfc.ac.uk/manifestation/45434584/RAL-TR-2020-003.pdf&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;that claims that the execution time for complex-to-complex FFTs with MKL is about twice as LOW as that of real-to-real FFTs of the same length (in particular, for a length of around 1,000,000).&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;This seems counter-intuitive, as the complex-to-complex transform should require about twice the number of operations. In fact, one can process two real-to-real transforms in one complex-to-complex call. I can imagine the execution time being roughly equal if complex arithmetic is performed in parallel in the micro-architecture, but it would puzzle me if it is lower.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;My first question is simply if this is true.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;My second question is what settings this result depends on (e.g. mkl_num_threads, compiler optimization, storage scheme, flags to encourage SIMD processing, et cetera).&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;If this is true for -Ofast compiling and when allowing MKL to grab any number of threads (up to the number of physical cores) then I will commit to using complex-to-complex transforms.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Postscript: I found the following remark:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;community.intel.com/t5/Software-Archive/How-to-get-peak-performnace-in-FFT/m-p/975037#M24838&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;from an Intel programmer, stating that more energy has been invested in optimizing the complex-to-complex FFTs, which might be a clue. However, it does not explain how or state that complex-to-complex is faster.&lt;/SPAN&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 19 Mar 2024 14:30:26 GMT</pubDate>
    <dc:creator>Van_Veen__Lennaert</dc:creator>
    <dc:date>2024-03-19T14:30:26Z</dc:date>
    <item>
      <title>Relative performance of real and complex FFTs in MKL</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Relative-performance-of-real-and-complex-FFTs-in-MKL/m-p/1581786#M35933</link>
      <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;I am writing a code that uses many FFTs (billions)&amp;nbsp; of fairly long sequences (2^p for p in the range 17-20, i.e. 100,000 - 1,00,000). It is crucial to find the fastest way to execute those. I read a technical report:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;A class="" href="https://epubs.stfc.ac.uk/manifestation/45434584/RAL-TR-2020-003.pdf" target="_blank" rel="noopener nofollow noreferrer"&gt;&lt;SPAN class=""&gt;https://epubs.stfc.ac.uk/manifestation/45434584/RAL-TR-2020-003.pdf&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;that claims that the execution time for complex-to-complex FFTs with MKL is about twice as LOW as that of real-to-real FFTs of the same length (in particular, for a length of around 1,000,000).&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;This seems counter-intuitive, as the complex-to-complex transform should require about twice the number of operations. In fact, one can process two real-to-real transforms in one complex-to-complex call. I can imagine the execution time being roughly equal if complex arithmetic is performed in parallel in the micro-architecture, but it would puzzle me if it is lower.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;My first question is simply if this is true.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;My second question is what settings this result depends on (e.g. mkl_num_threads, compiler optimization, storage scheme, flags to encourage SIMD processing, et cetera).&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;If this is true for -Ofast compiling and when allowing MKL to grab any number of threads (up to the number of physical cores) then I will commit to using complex-to-complex transforms.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Postscript: I found the following remark:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;community.intel.com/t5/Software-Archive/How-to-get-peak-performnace-in-FFT/m-p/975037#M24838&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;from an Intel programmer, stating that more energy has been invested in optimizing the complex-to-complex FFTs, which might be a clue. However, it does not explain how or state that complex-to-complex is faster.&lt;/SPAN&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 19 Mar 2024 14:30:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Relative-performance-of-real-and-complex-FFTs-in-MKL/m-p/1581786#M35933</guid>
      <dc:creator>Van_Veen__Lennaert</dc:creator>
      <dc:date>2024-03-19T14:30:26Z</dc:date>
    </item>
  </channel>
</rss>

