<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Intel® oneAPI Math Kernel Libraryのトピックopenmp FFT performance</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793627#M2462</link>
    <description>Thedropfrom 3.12x at 512K pointsto 1.5x at 1M points mainly corresponds to the cache boundary (8M for the E5530 CPU.)&lt;BR /&gt;&lt;BR /&gt;The rest of your data just seems to correlate with the number of floating point operations per point performed by MKL.&lt;BR /&gt;&lt;BR /&gt;In particular, the jump from 1.6x at 2M points to 2.4x at 4M points corresponds to the fact that at 4M points the lengths ofyour DFTs start to be divisible by 16 -- I take your M as 1000000.&lt;BR /&gt;&lt;BR /&gt;The number of cache misses is rather the consequence of piping more and more data through the cache -- it isn't the root cause of the behavior that you see.&lt;BR /&gt;</description>
    <pubDate>Mon, 25 Jul 2011 13:00:43 GMT</pubDate>
    <dc:creator>Evgueni_P_Intel</dc:creator>
    <dc:date>2011-07-25T13:00:43Z</dc:date>
    <item>
      <title>openmp FFT performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793622#M2457</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;I'm trying to parallelize the computation of multiple ffts by using OpenMP to divide the vectors amongst threads. I have used the link advisor to link with the sequential library and all other libraries. This is what I read I should do when parallelizing descriptor creation and FFT computation at &lt;A href="http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/"&gt;this link&lt;/A&gt;. The performance I am seeing is not what I expected. With two threads, I see a speedup of 1.83. With 4 threads, I see a speedup of 2.6. &lt;BR /&gt;&lt;BR /&gt;Also I inserted some code to time different parts of the program. I have some code to compute the average time to compute one FFT. Using four threads actually causes the computation time to increase on average. With one thread on a vector size of 614000, the average time to compute the transform is .647 seconds. With 2 threads, it is not too bad at .699 seconds. But with 4 threads, the average time to compute one transformis .983 seconds. &lt;BR /&gt;&lt;BR /&gt;I'm using the PGI Fortran compiler, and these are the libraries that I link with: mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib and mkl_solver_ilp64_sequential.lib&lt;BR /&gt;&lt;BR /&gt;I don't know if it has something to do with the libraries that I'm using or what, or maybe this speedup is normal and I'm being naive. I don't really know. I am a beginner, so I would appreciate any help that I could get. I could provide the timing info and some code snippets if needed.&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Wed, 06 Jul 2011 14:58:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793622#M2457</guid>
      <dc:creator>hwilliams11</dc:creator>
      <dc:date>2011-07-06T14:58:07Z</dc:date>
    </item>
    <item>
      <title>openmp FFT performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793623#M2458</link>
      <description>There're many unknowns in your post: CPU andnumber of CPUs in your system, total number of vectors, precision, domain, input/output placement, MKL version (some update to 10.2?)&lt;BR /&gt;One explanation could be that 4 (?)vectors of 614000 points + internal MKL buffersdon't fit in the last level cache on your system and MKL FFTs have to access the main memory which is much slower than accessing the cache.&lt;BR /&gt;If that is the case, you may see better speedups with shorter vectors.</description>
      <pubDate>Wed, 06 Jul 2011 18:27:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793623#M2458</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2011-07-06T18:27:56Z</dc:date>
    </item>
    <item>
      <title>openmp FFT performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793624#M2459</link>
      <description>Sorry. I'm using a quadcore Intel Xeon E5530 2.39 GHz. I'm running 16x614000 vectors across 1, 2, and 4 threads. I'm using double precision and these are out of place transforms. The MKL version is 10.2 update 1.</description>
      <pubDate>Thu, 07 Jul 2011 18:08:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793624#M2459</guid>
      <dc:creator>hwilliams11</dc:creator>
      <dc:date>2011-07-07T18:08:44Z</dc:date>
    </item>
    <item>
      <title>openmp FFT performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793625#M2460</link>
      <description>&lt;P&gt;Since the size of 16 vectors of 614000 points considerably exceeds the size of the last level cache, MKL performance that you see isaffected by the interconnect between the CPU and the main memory.&lt;BR /&gt;The Xeon E5530CPU can use 2 QPI links to the main memory.&lt;BR /&gt;&lt;BR /&gt;When there're only 2 threads, each of them uses its own QPI link.&lt;BR /&gt;So both computation and memory access are sped up ~2x compared to the sequential case.&lt;BR /&gt;&lt;BR /&gt;When there're 4 threads, each link is shared by 2 threads and &lt;STRONG&gt;&lt;EM&gt;only&lt;/EM&gt; &lt;EM&gt;computation is sped up &lt;/EM&gt;&lt;/STRONG&gt;~2x compared to the case with 2 threads, while memory access takes the same time as with 2 threads.&lt;BR /&gt;This is why the speedup for 4 threads is less than one may expect for your FFTs.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2011 08:40:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793625#M2460</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2011-07-08T08:40:37Z</dc:date>
    </item>
    <item>
      <title>openmp FFT performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793626#M2461</link>
      <description>Thanks for that info. I meant to say that I'm using 16 vectors of 61400 points. But I did understand what you were saying that the number of points played a big factor. I've tried the program with different data sizes using64,000 total points up to 32miltotal pointsto see if my results were consistent. Below I've pasted the speedup when using 4 threads. &lt;TABLE width="146" cellpadding="0" cellspacing="0" border="0"&gt;&lt;COLGROUP&gt;&lt;COL width="74" /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL width="72" /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;/COLGROUP&gt;&lt;TBODY&gt;&lt;TR height="17"&gt;&lt;TD width="74" height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;Size&lt;/P&gt;&lt;/TD&gt;&lt;TD width="72" class="xl24"&gt;&lt;P style="text-align: center;"&gt;Speedup-4&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;64K&lt;/P&gt;&lt;/TD&gt;&lt;TD num="3.7091546000000002" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;3.7091546&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;128K&lt;/P&gt;&lt;/TD&gt;&lt;TD num="3.6893183999999999" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;3.6893184&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;256K&lt;/P&gt;&lt;/TD&gt;&lt;TD num="3.4644230999999999" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;3.4644231&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;512K&lt;/P&gt;&lt;/TD&gt;&lt;TD num="3.1200787000000001" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;3.1200787&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;1M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="1.5317248000000001" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;1.5317248&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;2M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="1.6484851" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;1.6484851&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;4M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="2.2831603" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;2.2831603&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;8M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="2.2191637000000002" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;2.2191637&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;16M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="2.3427167999999998" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;2.3427168&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;32M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="2.2812101" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;2.2812101&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;So at first the speedup is close to 4, but then it starts to decrease. I figured this is because of cache misses. But there is a big decrease in speedup at 1M and 2M, and then speedup increases again. I figured out the amount of memory that I'm using in the program for each different data size. At1M data samples total, I'm using 32MB of memory and 64MB at 2M samples. I just was wondering why the speedup was so bad at those two particular data set sizes. At 512K samples, I am using 16MB of memory in the program, so at that point I would be out of the cache, but why is there such a decrease at 1 and 2 million, but then the speedup increases again with larger datasets. &lt;BR /&gt;&lt;BR /&gt;I used a performance profiler to see about cache misses and such, but the profiler does not show a dramatic increase in cache misses for those two data sizes.&lt;/P&gt;&lt;TABLE width="161" cellpadding="0" cellspacing="0" border="0"&gt;&lt;COLGROUP&gt;&lt;COL width="82" /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL width="79" /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;COL /&gt;&lt;/COLGROUP&gt;&lt;TBODY&gt;&lt;TR height="17"&gt;&lt;TD width="82" height="17" class="xl24"&gt;&lt;P style="text-align: center;"&gt;Size&lt;/P&gt;&lt;/TD&gt;&lt;TD width="79"&gt;&lt;P style="text-align: center;"&gt;Cache Miss&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;64K&lt;/P&gt;&lt;/TD&gt;&lt;TD class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.000&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;128K&lt;/P&gt;&lt;/TD&gt;&lt;TD class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.000&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;256K&lt;/P&gt;&lt;/TD&gt;&lt;TD class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.000&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;512K&lt;/P&gt;&lt;/TD&gt;&lt;TD num="0.19900000000000001" class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.199&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;1M&lt;/P&gt;&lt;/TD&gt;&lt;TD class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.330&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;2M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="0.378" class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.378&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;4M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="0.439" class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.439&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;8M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="0.45800000000000002" class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.458&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;16M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="0.42799999999999999" class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.428&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR height="17"&gt;&lt;TD height="17" class="xl25"&gt;&lt;P style="text-align: center;"&gt;32M&lt;/P&gt;&lt;/TD&gt;&lt;TD num="0.499" class="xl26" style="text-align: right;"&gt;&lt;P style="text-align: center;"&gt;0.499&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&lt;BR /&gt;Thank you for taking the time to answer my questions. I am a beginner at this topic so any help is greatly appreciated.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Jul 2011 19:24:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793626#M2461</guid>
      <dc:creator>hwilliams11</dc:creator>
      <dc:date>2011-07-22T19:24:39Z</dc:date>
    </item>
    <item>
      <title>openmp FFT performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793627#M2462</link>
      <description>Thedropfrom 3.12x at 512K pointsto 1.5x at 1M points mainly corresponds to the cache boundary (8M for the E5530 CPU.)&lt;BR /&gt;&lt;BR /&gt;The rest of your data just seems to correlate with the number of floating point operations per point performed by MKL.&lt;BR /&gt;&lt;BR /&gt;In particular, the jump from 1.6x at 2M points to 2.4x at 4M points corresponds to the fact that at 4M points the lengths ofyour DFTs start to be divisible by 16 -- I take your M as 1000000.&lt;BR /&gt;&lt;BR /&gt;The number of cache misses is rather the consequence of piping more and more data through the cache -- it isn't the root cause of the behavior that you see.&lt;BR /&gt;</description>
      <pubDate>Mon, 25 Jul 2011 13:00:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/openmp-FFT-performance/m-p/793627#M2462</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2011-07-25T13:00:43Z</dc:date>
    </item>
  </channel>
</rss>

