<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Poor scaling for real-to-real FFT with OpenMP in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Poor-scaling-for-real-to-real-FFT-with-OpenMP/m-p/1105873#M24103</link>
    <description>&lt;P&gt;In the attached file I use MKL to compute a real-to-real FFT using OpenMP for multithreading.&lt;/P&gt;

&lt;P&gt;The code is compiled with&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;icpc -o bench-fft -Wall -O3 -g -march=native -fopenmp bench-fft.cxx -mkl
&lt;/PRE&gt;

&lt;P&gt;The machine has 4 cores.&lt;/P&gt;

&lt;P&gt;It seems that the code does not scale well with the number of threads.&lt;/P&gt;

&lt;P&gt;When run with&lt;/P&gt;

&lt;P&gt;OMP_NUM_THREADS=1 ./bench-fft 4194304&lt;/P&gt;

&lt;P&gt;the total time taken is 0.1640 user, 0.0440 sys while with&lt;/P&gt;

&lt;P&gt;OMP_NUM_THREADS=2 ./bench-fft 4194304&lt;/P&gt;

&lt;P&gt;the total time taken is 0.3000 user, 0.0560 sys. So there seems to be a large synchronization overhead since the total CPU time almost doubles.&lt;/P&gt;

&lt;P&gt;Is this to be expected or am I doing something wrong in my code.&lt;/P&gt;</description>
    <pubDate>Sat, 13 May 2017 12:08:07 GMT</pubDate>
    <dc:creator>Jyotirmoy_B_</dc:creator>
    <dc:date>2017-05-13T12:08:07Z</dc:date>
    <item>
      <title>Poor scaling for real-to-real FFT with OpenMP</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Poor-scaling-for-real-to-real-FFT-with-OpenMP/m-p/1105873#M24103</link>
      <description>&lt;P&gt;In the attached file I use MKL to compute a real-to-real FFT using OpenMP for multithreading.&lt;/P&gt;

&lt;P&gt;The code is compiled with&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;icpc -o bench-fft -Wall -O3 -g -march=native -fopenmp bench-fft.cxx -mkl
&lt;/PRE&gt;

&lt;P&gt;The machine has 4 cores.&lt;/P&gt;

&lt;P&gt;It seems that the code does not scale well with the number of threads.&lt;/P&gt;

&lt;P&gt;When run with&lt;/P&gt;

&lt;P&gt;OMP_NUM_THREADS=1 ./bench-fft 4194304&lt;/P&gt;

&lt;P&gt;the total time taken is 0.1640 user, 0.0440 sys while with&lt;/P&gt;

&lt;P&gt;OMP_NUM_THREADS=2 ./bench-fft 4194304&lt;/P&gt;

&lt;P&gt;the total time taken is 0.3000 user, 0.0560 sys. So there seems to be a large synchronization overhead since the total CPU time almost doubles.&lt;/P&gt;

&lt;P&gt;Is this to be expected or am I doing something wrong in my code.&lt;/P&gt;</description>
      <pubDate>Sat, 13 May 2017 12:08:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Poor-scaling-for-real-to-real-FFT-with-OpenMP/m-p/1105873#M24103</guid>
      <dc:creator>Jyotirmoy_B_</dc:creator>
      <dc:date>2017-05-13T12:08:07Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Is this to be expected</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Poor-scaling-for-real-to-real-FFT-with-OpenMP/m-p/1105874#M24104</link>
      <description>&amp;gt;&amp;gt;...Is this to be expected or am I doing something wrong in my code.

Try to set &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; to &lt;STRONG&gt;scatter&lt;/STRONG&gt; or &lt;STRONG&gt;compact&lt;/STRONG&gt; and use more OpenMP threads. In case of a Linux OS use &lt;STRONG&gt;Htop&lt;/STRONG&gt; utility to verify how threads are pinned to cores.</description>
      <pubDate>Mon, 15 May 2017 18:58:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Poor-scaling-for-real-to-real-FFT-with-OpenMP/m-p/1105874#M24104</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-15T18:58:01Z</dc:date>
    </item>
  </channel>
</rss>

