<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to get best performance with MKL FFT ? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862416#M7591</link>
    <description>&lt;DIV style="margin:0px;"&gt;Thank you for your reply.&lt;BR /&gt;&lt;BR /&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 130.61%; height: 642px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/93647"&gt;Dmitry Baksheev (Intel)&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;BR /&gt;&lt;EM&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;Thank you for providing alldetails in your question.There are several things to comment on.&lt;BR /&gt;&lt;BR /&gt;1D real-to-complex transforms are not threaded in MKL, so you shouldonly observemaximum 3/24CPU usage when in DftiComputeForward. What you see as 30% is probably the parallel initialization of array[] in work().&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="font-size: x-small;"&gt;How about 1D complex-to-complex FFT ?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;&lt;SPAN style="font-size: x-small;"&gt; &lt;BR /&gt;&lt;/SPAN&gt;&lt;BR /&gt;In work() you create a descriptor and use it on the same thread, that is the descriptor is not shared between multiple user threads,so it makes no sense to set DFTI_NUMBER_OF_USER_THREADS there. It would make sense if the descriptor were created in test() and then shared by threeinvocations of work() that would use the descriptor concurrently.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;SPAN style="font-size: x-small;"&gt;&lt;STRONG&gt;Do you mean the following three functions should be invoked in test() ?&lt;BR /&gt;&lt;BR /&gt;status = DftiCreateDescriptor(&amp;amp;my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);&lt;BR /&gt;status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread); // nThread = ?&lt;BR /&gt;status = DftiCommitDescriptor(my_handle);&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;EM&gt;By default, MKL would not go parallel if it is called from omp parallel region. You should explicitly let it go parallel by setting environment variable MKL_DYNAMIC=false and allowing nested parallelism by setting OMP_NESTED=true (the lattermight be the default setting). Of course, you may set this by calling respective functions instead of setting environment variables.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="font-size: x-small;"&gt;Should I call MKL_set_dynamic(false) andomp_set_nested(1) in test()?&lt;BR /&gt;How to call status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread) ?&lt;BR /&gt;Shoudl nThread be be 1 or without calling DftiSetValue() at all ?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;And finally please notice that real-to-complex transform of size N requires sligthly more space than double&lt;N&gt;, because the result consists of N/2+1 complex numbers, that is size of arrays should be at least 2*(size/2+1) in your case.&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;Dima&lt;BR /&gt;&lt;/N&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;</description>
    <pubDate>Mon, 14 Dec 2009 06:45:20 GMT</pubDate>
    <dc:creator>afd_lml</dc:creator>
    <dc:date>2009-12-14T06:45:20Z</dc:date>
    <item>
      <title>How to get best performance with MKL FFT ?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862414#M7589</link>
      <description>Hi, all&lt;BR /&gt; My workstation has 24-core. The source code is something like the following:&lt;BR /&gt;&lt;BR /&gt;void test()&lt;BR /&gt;{&lt;BR /&gt; // define some data&lt;BR /&gt; const intsize = 10000;&lt;BR /&gt; double x[size], y[size], z[size];&lt;BR /&gt; doule sum[size];&lt;BR /&gt;&lt;BR /&gt; // nowwecompute x,y,z&lt;BR /&gt; work(x, size);&lt;BR /&gt; work(y, size);&lt;BR /&gt; work(z, size);&lt;BR /&gt;&lt;BR /&gt; // sum x, y, z&lt;BR /&gt; for(int i=0; i&lt;SIZE&gt;&lt;/SIZE&gt; sum&lt;I&gt; = x&lt;I&gt; + y&lt;I&gt; + z&lt;I&gt;;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;void work(double array[], const int size)&lt;BR /&gt;{&lt;BR /&gt; // write data to array&lt;BR /&gt; for (int i=0; i&lt;SIZE&gt;&lt;/SIZE&gt; array&lt;I&gt; = .....;&lt;BR /&gt;&lt;BR /&gt; // do FFT with MKL&lt;BR /&gt; DFTI_DESCRIPTOR_HANDLE my_handle;&lt;BR /&gt;&lt;BR /&gt; MKL_LONG status;&lt;BR /&gt;&lt;BR /&gt; status = DftiCreateDescriptor(&amp;amp;my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);&lt;BR /&gt;&lt;BR /&gt; status = DftiCommitDescriptor(my_handle);&lt;BR /&gt;&lt;BR /&gt; status = DftiComputeForward(my_handle, array);&lt;BR /&gt;&lt;BR /&gt; status = DftiFreeDescriptor(&amp;amp;my_handle); &lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;Then, I use openMP to rewrite the above code as follows:&lt;BR /&gt;&lt;BR /&gt;void test()&lt;BR /&gt;{&lt;BR /&gt; // define some data &lt;BR /&gt; const int size = 10000;&lt;BR /&gt; double x[size], y[size], z[size];&lt;BR /&gt; doule sum[size];&lt;BR /&gt;&lt;BR /&gt;// now we compute x,y,z&lt;BR /&gt;#pragma omp parallel&lt;BR /&gt;{&lt;BR /&gt; #pragma omp sections&lt;BR /&gt; {&lt;BR /&gt;#pragma omp section&lt;BR /&gt;work(x, size);&lt;BR /&gt;&lt;BR /&gt; #pragma omp section&lt;BR /&gt;  work(y, size);&lt;BR /&gt; &lt;BR /&gt;  #pragma omp section&lt;BR /&gt;work(z, size);&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt; // sum x, y, z&lt;BR /&gt; #pragma ompfor&lt;BR /&gt; for(int i=0; i&lt;SIZE&gt;&lt;/SIZE&gt; sum&lt;I&gt; = x&lt;I&gt; + y&lt;I&gt; + z&lt;I&gt;;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;void work(double array[], const int size)&lt;BR /&gt;{&lt;BR /&gt; // write data to array&lt;BR /&gt; #pragma omp parallel for&lt;BR /&gt; for (int i=0; i&lt;SIZE&gt;&lt;/SIZE&gt; array&lt;I&gt; = .....;&lt;BR /&gt;&lt;BR /&gt; int nThread = 8; // the computer has 24 cores, so I set 24/3=8&lt;BR /&gt;&lt;BR /&gt; set_MKL_num_threads(nThread); // the computer has 24 cores, so I set 24/3=8&lt;BR /&gt;&lt;BR /&gt; // do FFT with MKL&lt;BR /&gt; DFTI_DESCRIPTOR_HANDLE my_handle;&lt;BR /&gt;&lt;BR /&gt; MKL_LONG status;&lt;BR /&gt;&lt;BR /&gt; status = DftiCreateDescriptor(&amp;amp;my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);&lt;BR /&gt;&lt;BR /&gt; status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);&lt;BR /&gt;&lt;BR /&gt; status = DftiCommitDescriptor(my_handle);&lt;BR /&gt;&lt;BR /&gt; status = DftiComputeForward(my_handle, array);&lt;BR /&gt;&lt;BR /&gt; status = DftiFreeDescriptor(&amp;amp;my_handle); &lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;However, from Windows Task Manager, I find the CPU usage is only 30%, not the expected 100%, why ?&lt;BR /&gt;&lt;BR /&gt;Would anyone give me some help ? Many thanks.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Mon, 14 Dec 2009 04:26:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862414#M7589</guid>
      <dc:creator>afd_lml</dc:creator>
      <dc:date>2009-12-14T04:26:51Z</dc:date>
    </item>
    <item>
      <title>Re: How to get best performance with MKL FFT ?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862415#M7590</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;Thank you for providing alldetails in your question.There are several things to comment on.&lt;BR /&gt;&lt;BR /&gt;1D real-to-complex transforms are not threaded in MKL, so you shouldonly observemaximum 3/24CPU usage when in DftiComputeForward. What you see as 30% is probably the parallel initialization of array[] in work().&lt;BR /&gt;&lt;BR /&gt;In work() you create a descriptor and use it on the same thread, that is the descriptor is not shared between multiple user threads,so it makes no sense to set DFTI_NUMBER_OF_USER_THREADS there. It would make sense if the descriptor were created in test() and then shared by threeinvocations of work() that would use the descriptor concurrently.&lt;BR /&gt;&lt;BR /&gt;By default, MKL would not go parallel if it is called from omp parallel region. You should explicitly let it go parallel by setting environment variable MKL_DYNAMIC=false and allowing nested parallelism by setting OMP_NESTED=true (the lattermight be the default setting). Of course, you may set this by calling respective functions instead of setting environment variables.&lt;BR /&gt;&lt;BR /&gt;And finally please notice that real-to-complex transform of size N requires sligthly more space than double&lt;N&gt;, because the result consists of N/2+1 complex numbers, that is size of arrays should be at least 2*(size/2+1) in your case.&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;Dima&lt;BR /&gt;&lt;/N&gt;</description>
      <pubDate>Mon, 14 Dec 2009 05:16:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862415#M7590</guid>
      <dc:creator>Dmitry_B_Intel</dc:creator>
      <dc:date>2009-12-14T05:16:27Z</dc:date>
    </item>
    <item>
      <title>Re: How to get best performance with MKL FFT ?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862416#M7591</link>
      <description>&lt;DIV style="margin:0px;"&gt;Thank you for your reply.&lt;BR /&gt;&lt;BR /&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 130.61%; height: 642px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/93647"&gt;Dmitry Baksheev (Intel)&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;BR /&gt;&lt;EM&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;Thank you for providing alldetails in your question.There are several things to comment on.&lt;BR /&gt;&lt;BR /&gt;1D real-to-complex transforms are not threaded in MKL, so you shouldonly observemaximum 3/24CPU usage when in DftiComputeForward. What you see as 30% is probably the parallel initialization of array[] in work().&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="font-size: x-small;"&gt;How about 1D complex-to-complex FFT ?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;&lt;SPAN style="font-size: x-small;"&gt; &lt;BR /&gt;&lt;/SPAN&gt;&lt;BR /&gt;In work() you create a descriptor and use it on the same thread, that is the descriptor is not shared between multiple user threads,so it makes no sense to set DFTI_NUMBER_OF_USER_THREADS there. It would make sense if the descriptor were created in test() and then shared by threeinvocations of work() that would use the descriptor concurrently.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;SPAN style="font-size: x-small;"&gt;&lt;STRONG&gt;Do you mean the following three functions should be invoked in test() ?&lt;BR /&gt;&lt;BR /&gt;status = DftiCreateDescriptor(&amp;amp;my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);&lt;BR /&gt;status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread); // nThread = ?&lt;BR /&gt;status = DftiCommitDescriptor(my_handle);&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;EM&gt;By default, MKL would not go parallel if it is called from omp parallel region. You should explicitly let it go parallel by setting environment variable MKL_DYNAMIC=false and allowing nested parallelism by setting OMP_NESTED=true (the lattermight be the default setting). Of course, you may set this by calling respective functions instead of setting environment variables.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="font-size: x-small;"&gt;Should I call MKL_set_dynamic(false) andomp_set_nested(1) in test()?&lt;BR /&gt;How to call status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread) ?&lt;BR /&gt;Shoudl nThread be be 1 or without calling DftiSetValue() at all ?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;And finally please notice that real-to-complex transform of size N requires sligthly more space than double&lt;N&gt;, because the result consists of N/2+1 complex numbers, that is size of arrays should be at least 2*(size/2+1) in your case.&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;Dima&lt;BR /&gt;&lt;/N&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;</description>
      <pubDate>Mon, 14 Dec 2009 06:45:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862416#M7591</guid>
      <dc:creator>afd_lml</dc:creator>
      <dc:date>2009-12-14T06:45:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to get best performance with MKL FFT ?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862417#M7592</link>
      <description>&lt;BR /&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;Complex-to-complex 1D FFT of size &amp;gt; 2^16 (or smaller if it is 2-power) are threaded.&lt;BR /&gt;&lt;BR /&gt;If you are going to share descriptor between K threads (K=3 in your example), then it should be done like this:&lt;BR /&gt;&lt;BR /&gt;MKL_set_dynamic(0);&lt;BR /&gt;omp_set_nested(1);&lt;BR /&gt;...&lt;BR /&gt;DftiSetValue(hand, DFTI_NUMBER_OF_USER_THREADS, K);&lt;BR /&gt;DftiCommitDescriptor(hand);&lt;BR /&gt;#pragma omp parallel&lt;BR /&gt;{&lt;BR /&gt; ...&lt;BR /&gt; work(..., hand); // K calls in the parallel team&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;Dima</description>
      <pubDate>Mon, 14 Dec 2009 07:04:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-get-best-performance-with-MKL-FFT/m-p/862417#M7592</guid>
      <dc:creator>Dmitry_B_Intel</dc:creator>
      <dc:date>2009-12-14T07:04:38Z</dc:date>
    </item>
  </channel>
</rss>

