<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Can you set a size other than in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048741#M49017</link>
    <description>&lt;P&gt;Can you set a size other than power of two, say 504x504 or 520x520 (assuming doubles). Use multiple of cache line but not power of 2.&lt;/P&gt;

&lt;P&gt;John D. McCalpin wrote good reasons as to avoid power of two sizes on this forum. Search his name, you should find a link to the posting.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Sat, 08 Nov 2014 03:01:29 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2014-11-08T03:01:29Z</dc:date>
    <item>
      <title>Poor FFT mkl performance</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048738#M49014</link>
      <description>&lt;P&gt;I am optimizing a new application (written with Xeon Phi in mind) which performs a lot of FFT transforms.&lt;/P&gt;

&lt;P&gt;The transforms are done on 512x512 arrays separately in each thread. This works quite well on Xeon host. When running on Xeon Phi in native mode the performance is much slower than expected.&lt;/P&gt;

&lt;P&gt;After profiling (screen shot attached) I see that a lot of time is spent in mkl_dft_grasp_user_thread() - can anyone tell me what this function does (I was not able find anything on google) and whether there is any way to mitigate the performance issue.&lt;/P&gt;

&lt;P&gt;thank you very much&lt;/P&gt;

&lt;P&gt;Vladimir Dergachev&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 06 Nov 2014 20:09:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048738#M49014</guid>
      <dc:creator>Vladimir_Dergachev</dc:creator>
      <dc:date>2014-11-06T20:09:06Z</dc:date>
    </item>
    <item>
      <title>Let me state something that</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048739#M49015</link>
      <description>&lt;P&gt;Let me state something that to the uninitiated will seem counter-intuitive.&lt;/P&gt;

&lt;P&gt;When using a multi-threaded program, each thread calling MKL, then you are supposed to link with the &lt;EM&gt;&lt;STRONG&gt;single-threaded &lt;/STRONG&gt;&lt;/EM&gt;version of MKL. To link with the multi-threaded MKL will cause each calling thread's instance to spawn a new thread pool.&lt;/P&gt;

&lt;P&gt;Let me make a caveated here. This has happened enough times that MKL may have been modified to detect this, and instantiate one thread pool. While this may prove to be satisfactory when each user thread intermittently calls MKL, it may be adverse when many user threads concurrently call MKL.&lt;/P&gt;

&lt;P&gt;From the name mkl_dft_grasp_user_thread() it is not clear as to which of the two cases is the cause. However, linking with the single-threaded MKL (in this instance) may produce the results you seek.&lt;/P&gt;

&lt;P&gt;You may want to experiment using 2, 3, and 4 threads per core.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2014 20:52:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048739#M49015</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-11-07T20:52:11Z</dc:date>
    </item>
    <item>
      <title>I did try linking by</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048740#M49016</link>
      <description>&lt;P&gt;I did try linking by specifying --mkl=sequential or --mkl=parallel but I get essentially the same trace.&amp;nbsp; Also in that particular case the application was using only 30 threads leaving plenty of room for extra threads if needed.&lt;/P&gt;

&lt;P&gt;best&lt;/P&gt;

&lt;P&gt;Vladimir Dergachev&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2014 21:17:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048740#M49016</guid>
      <dc:creator>Vladimir_Dergachev</dc:creator>
      <dc:date>2014-11-07T21:17:37Z</dc:date>
    </item>
    <item>
      <title>Can you set a size other than</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048741#M49017</link>
      <description>&lt;P&gt;Can you set a size other than power of two, say 504x504 or 520x520 (assuming doubles). Use multiple of cache line but not power of 2.&lt;/P&gt;

&lt;P&gt;John D. McCalpin wrote good reasons as to avoid power of two sizes on this forum. Search his name, you should find a link to the posting.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sat, 08 Nov 2014 03:01:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048741#M49017</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-11-08T03:01:29Z</dc:date>
    </item>
    <item>
      <title>You would want each copy of</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048742#M49018</link>
      <description>&lt;P&gt;You would want each copy of mkl to use a small team of threads, with the total number less than 4 times number of cores. By default, omp_nested is off so you would not likely use enough threads.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Nov 2014 05:57:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048742#M49018</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-11-08T05:57:53Z</dc:date>
    </item>
    <item>
      <title>Dear Vladimir,</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048743#M49019</link>
      <description>&lt;P&gt;Dear Vladimir,&lt;/P&gt;

&lt;P&gt;You may find useful the following&amp;nbsp;article.&lt;/P&gt;

&lt;P&gt;Please also consider upgrading to the latest MKL version.&lt;/P&gt;

&lt;P&gt;The FFT performance has been improved&amp;nbsp;since&amp;nbsp;MKL 11.1&amp;nbsp;was released a year ago.&lt;/P&gt;

&lt;P&gt;Evgueni.&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors"&gt;https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2014 03:37:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048743#M49019</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2014-11-10T03:37:07Z</dc:date>
    </item>
    <item>
      <title>For multi-dimensional FFTs</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048744#M49020</link>
      <description>&lt;P&gt;For multi-dimensional FFTs you want to transform vectors with lengths that are powers of 2 for performance, but you also want to pad the data storage so that independent transforms are not accessing vectors that are separated by powers of two.&lt;/P&gt;

&lt;P&gt;This is discussed at &lt;A href="https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors" target="_blank"&gt;https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2014 15:36:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048744#M49020</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-11-10T15:36:17Z</dc:date>
    </item>
    <item>
      <title>Great, thanks for the</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048745#M49021</link>
      <description>&lt;P&gt;Great, thanks for the suggestions !&lt;/P&gt;

&lt;P&gt;I am going to try using non-power of 2 image.&lt;/P&gt;

&lt;P&gt;However, I would have expected if cache aliasing was a problem I would see a lot of time spent in a function that does computation and a lot of cache misses. But what I see instead is that most of the time is spent in mkl_dft_grasp_user_thread() and it increases sharply with number of threads allocated to the process. Which suggests that the problem is contention of some sort, but why do we need to "grasp" threads even in case of a sequential library ?&lt;/P&gt;

&lt;P&gt;It's too bad the source is not available as it is for fftw.&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2014 18:03:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048745#M49021</guid>
      <dc:creator>Vladimir_Dergachev</dc:creator>
      <dc:date>2014-11-10T18:03:52Z</dc:date>
    </item>
    <item>
      <title>The observed "hot" routine</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048746#M49022</link>
      <description>&lt;P&gt;The observed "hot" routine certainly seems strange for this use case.&lt;/P&gt;

&lt;P&gt;Until someone from Intel comments, it might be useful to look at this a couple of different ways:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Can you give us an idea of the absolute performance (e.g., seconds per 512x512 transform) for a few different numbers of independent threads?&lt;/LI&gt;
	&lt;LI&gt;Does the execution time change much when you don't profile with VTune?&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Mon, 10 Nov 2014 22:46:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048746#M49022</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-11-10T22:46:50Z</dc:date>
    </item>
    <item>
      <title>It would also be helpful to</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048747#M49023</link>
      <description>&lt;P&gt;It would also be helpful to understand what threading model you are using and how the environment is set up.&amp;nbsp;&amp;nbsp;&amp;nbsp; For example, it is important to make sure that the independent threads calling MKL routines don't end up getting bound (in MKL) to the same core.&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2014 22:50:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048747#M49023</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-11-10T22:50:01Z</dc:date>
    </item>
    <item>
      <title>Can you check if this covers</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048748#M49024</link>
      <description>&lt;P&gt;Can you check if this covers your use case?&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/" target="_blank"&gt;https://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;If the FFT grid is the same for all the FFTs you need,&amp;nbsp;&lt;SPAN style="font-size: 12.7272720336914px; line-height: 17.7381820678711px;"&gt;&amp;nbsp;performing&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;multiple FFTs simultaneously is an option.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;status=DftiSetValue(my_handle,DFTI_NUMBER_OF_TRANSFORMS,howmany);
...
DftiCommitDescriptor(my_handle); //commit the handle

for(int i=0; i&amp;lt;num_fft; i+=howmany)
  fft(handle,data&lt;I&gt;); // data&lt;I&gt;=starting address of the i-th data on a FFT grid&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;For instance with 240 threads, one can use howmany=60, which is equivalent to doing 1 FFT on 1 core/4 threads. The optimal howmany will depend on the FFT grid for the memory and speed.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Alternatively, you can use nested OpenMP which looks like this&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;status=DftiSetValue(my_handle,DFTI_NUMBER_OF_USER_THREADS,num_user_threads);
...
DftiCommitDescriptor(my_handle); //commit the handle

#pragma omp parallel num_threads(num_user_threads);
for(int i=0; i&amp;lt;howmany; ++i)
  fft(my_handle,data&lt;I&gt;); // threaded MKL&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;One needs to set the property of a handle so that multiple threads can share the same plan (Case 4 in the URL above). I'm concerned that the performance of nested OpenMP is not going to be great unless the envs. are set very carefully.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As John mentioned, the data alignment and padding can have big impacts on the performance. Use DftiSetValue API to fine tune these. See&amp;nbsp;https://software.intel.com/en-us/node/521959&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2014 23:18:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048748#M49024</guid>
      <dc:creator>Jeongnim_K_Intel1</dc:creator>
      <dc:date>2014-11-10T23:18:32Z</dc:date>
    </item>
    <item>
      <title>Looks like the problem goes</title>
      <link>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048749#M49025</link>
      <description>&lt;P&gt;Looks like the problem goes away if I create a separate plan for each thread, rather than using the same plan.&lt;/P&gt;

&lt;P&gt;Given that mkl descriptors are lighter than fftw plans this is not so bad.&lt;/P&gt;

&lt;P&gt;Thank you for all the suggestions ! Off to optimize it further..&lt;/P&gt;

&lt;P&gt;best&lt;/P&gt;

&lt;P&gt;Vladimir Dergachev&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2014 01:51:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Poor-FFT-mkl-performance/m-p/1048749#M49025</guid>
      <dc:creator>Vladimir_Dergachev</dc:creator>
      <dc:date>2014-11-11T01:51:30Z</dc:date>
    </item>
  </channel>
</rss>

