<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic No problem, happy to help.  I in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037029#M20467</link>
    <description>&lt;P&gt;No problem, happy to help. &amp;nbsp;I had a few follow up questions. &amp;nbsp;I noticed when using a complex DFT in the example provided if I change DFTI_ORDERED to DFTI_BACKWARD_SCRAMBLED the thread limit automatically gets set to 1 and only a single thread is used. &amp;nbsp;Is this behavior expected? &amp;nbsp;I couldn't find anything in the documentation about ordering impacting thread limit.&lt;/P&gt;

&lt;P&gt;Also, if I use two different DFT descriptor handles within the same programming scope and I want each to use a different number of threads (lets say 4 and 2 respectively), is it safe to call mkl_set_num_threads(4), commit the first DFT descriptor, then call mkl_set_num_threads(2) and commit the second descriptor then use both within a loop? &amp;nbsp;Will this effectively let the first handle use 4 threads and the second use 2 (assuming I have 6+ cpus to work with on the machine)?&lt;/P&gt;

&lt;P&gt;Cheers,&lt;/P&gt;

&lt;P&gt;Nick&lt;/P&gt;</description>
    <pubDate>Thu, 23 Apr 2015 21:58:54 GMT</pubDate>
    <dc:creator>nicpac22</dc:creator>
    <dc:date>2015-04-23T21:58:54Z</dc:date>
    <item>
      <title>Complex 1-D DFT not respecting DFTI_THREAD_LIMIT</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037026#M20464</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I recently noticed that when using a threaded 1-dimensional DFT, a DFTI_COMPLEX domain DFT does not appear to respect the DFTI_THREAD_LIMIT and instead always uses the threading value set by mkl_set_num_threads(). &amp;nbsp;Furthermore, it appears that while a REAL domain DFT does obey DFTI_THREAD_LIMIT, its behavior has changed between MKL v11.1 and 11.2.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;In MKL v11.1, setting mkl_set_num_threads(4) followed by DftiSetValue(dft_handle, DFTI_THREAD_LIMIT, 2) caused a complex-valued DFT to spawn and use 4 threads while a real-valued DFT only spawned and used 2 threads. &amp;nbsp;In MKL 11.2, the complex-valued DFT still spawned and used 4 cores, but now the real-valued DFT also spawned 4 cores but only utilized 2 of them (2 cores were utilized at 90%+ and 2 were utilized at ~10%). &amp;nbsp;Below is the example code I've been using to replicate this behavior. &amp;nbsp;I was wondering if I'm doing something wrong or possibly misunderstanding the expected behavior of DFTI_THREAD_LIMIT.&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;// mkl_thread_test.cpp - Computes large threaded DFTs
//  arg1 = 'C' for complex domain or 'R' for real (optional, default REAL)
//  arg2 = scale factor for DFT (optional, default 1)

#include &amp;lt;iostream&amp;gt;
#include &amp;lt;cstdlib&amp;gt;
#include &amp;lt;vector&amp;gt;
#include &amp;lt;string.h&amp;gt;
#include &amp;lt;mkl.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

using namespace std;

int main(int argc, char* argv[])
{
  int mklThreads = 4;
  int dftThreads = 2;
  int dftSize = 10000000;
  int loops = 100;
  
  DFTI_CONFIG_VALUE domain = DFTI_REAL;
  int sizeMultiplier = 1;
  if(argv[1][0] == 'C')
  {
    domain = DFTI_COMPLEX;
    sizeMultiplier = 2;
  }
  float scale = 1.0f;
  if(argc &amp;gt; 2) scale = atof(argv[2]);
  
  // print version number for reference
  char version[DFTI_VERSION_LENGTH];
  DftiGetValue(0, DFTI_VERSION, version);
  cerr&amp;lt;&amp;lt;"MKL Version: "&amp;lt;&amp;lt;version&amp;lt;&amp;lt;endl;
  
  vector&amp;lt;float&amp;gt; vin((dftSize+loops)*sizeMultiplier);
  vector&amp;lt;float&amp;gt; vout(dftSize*sizeMultiplier);
  vector&amp;lt;float&amp;gt; vtmp(loops); // for saving output to avoid compiler optimizing out computation
  for(int i=0; i&amp;lt;vin.size(); ++i)
  {
    vin&lt;I&gt; = float(rand())/float(RAND_MAX)-.5;
  }
  
  MKL_LONG status;
  DFTI_DESCRIPTOR_HANDLE dft;
  mkl_set_num_threads(mklThreads);
  status = DftiCreateDescriptor(&amp;amp;dft, DFTI_SINGLE, domain, 1, dftSize);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  status = DftiSetValue(dft, DFTI_PLACEMENT, DFTI_INPLACE);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  status = DftiSetValue(dft, DFTI_PACKED_FORMAT, DFTI_PERM_FORMAT);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  status = DftiSetValue(dft, DFTI_ORDERING, DFTI_ORDERED);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  status = DftiSetValue(dft, DFTI_FORWARD_SCALE, scale);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  status = DftiSetValue(dft, DFTI_THREAD_LIMIT, dftThreads);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  status = DftiCommitDescriptor(dft);
  if (status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  cerr&amp;lt;&amp;lt;"Computing "&amp;lt;&amp;lt;loops;
  if(domain == DFTI_COMPLEX) cerr&amp;lt;&amp;lt;" COMPLEX";
  else cerr&amp;lt;&amp;lt;" REAL";
  cerr&amp;lt;&amp;lt;" DFTs of size "&amp;lt;&amp;lt;dftSize&amp;lt;&amp;lt;" using "&amp;lt;&amp;lt;mklThreads&amp;lt;&amp;lt;" MKL threads but limiting DFT to "&amp;lt;&amp;lt;dftThreads&amp;lt;&amp;lt;" threads"&amp;lt;&amp;lt;endl;
  
  MKL_LONG threadLimit;
  status = DftiGetValue(dft, DFTI_THREAD_LIMIT, &amp;amp;threadLimit);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  cerr&amp;lt;&amp;lt;"Thread limit: "&amp;lt;&amp;lt;threadLimit&amp;lt;&amp;lt;endl;
  
  for(int i=0; i&amp;lt;loops; ++i)
  {
    memcpy(&amp;amp;vout[0], &amp;amp;vin&lt;I&gt;, dftSize*sizeMultiplier*sizeof(float));
    status = DftiComputeForward(dft, &amp;amp;vout[0]);
    if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
    vtmp&lt;I&gt; = vout[0]; // save a value to avoid optimizing out the DFT computation
  }
  cerr&amp;lt;&amp;lt;"Finished execution"&amp;lt;&amp;lt;endl;
  status = DftiFreeDescriptor(&amp;amp;dft);
  if(status != DFTI_NO_ERROR) cerr&amp;lt;&amp;lt;DftiErrorMessage(status)&amp;lt;&amp;lt;endl;
  
  return 0;
}
&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;For reference, I'm running on a quad-core processor running 64-bit linux (Ubuntu 14.04) and compiling with icpc v14.0.4 (for MKL v11.1) and v15.0.2 (for MKL v11.2). &amp;nbsp;My compile line looks like:&lt;/P&gt;

&lt;P&gt;icpc -O3 -xHost -openmp -I${MKLROOT}/include -o mkl_thread_test mkl_thread_test.cpp -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -liomp5&lt;/P&gt;

&lt;P&gt;When I call ./mkl_thread_test C, I notice 4 cores spinning at 95%, when I call ./mkl_thread_test R, I notice just 2 cores spinning at 95% for MKL v11.1 and 2 cores spinning at 95% plus 2 more cores spinning at 10% for MKL v11.2. &amp;nbsp;My exact versions of MKL are&amp;nbsp;11.1.4 Product Build 20140806 and&amp;nbsp;11.2.2 Product Build 20150120. &amp;nbsp;In both cases, the value returned by DftiGetValue(DFTI_THREAD_LIMIT) is 2 so my expectation is that the DFT should only be using 2 threads regardless of MKL version or real vs. complex DFT.&lt;/P&gt;

&lt;P&gt;Am I doing something wrong or should MKL be respecting the value of DFTI_THREAD_LIMIT?&lt;/P&gt;

&lt;P&gt;--Nick&lt;/P&gt;</description>
      <pubDate>Thu, 23 Apr 2015 00:34:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037026#M20464</guid>
      <dc:creator>nicpac22</dc:creator>
      <dc:date>2015-04-23T00:34:52Z</dc:date>
    </item>
    <item>
      <title>Hi Nick,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037027#M20465</link>
      <description>&lt;P&gt;Hi Nick,&lt;/P&gt;

&lt;P&gt;Thank for the sample,&amp;nbsp; I had supposed, the threaded of 1D FFT are limited here. but &amp;nbsp;according to the&amp;nbsp;doc.&lt;/P&gt;

&lt;P&gt;1D transforms are threaded in many cases.&lt;BR /&gt;
	&lt;BR /&gt;
	(N) &amp;gt; 16, and input/output strides equal 1.&lt;BR /&gt;
	1D complex-to-complex transforms using split-complex layout are not threaded.&lt;BR /&gt;
	Multidimensional transforms&lt;/P&gt;

&lt;P&gt;We&amp;nbsp;will look into this and get back to you.&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Ying&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Apr 2015 08:38:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037027#M20465</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2015-04-23T08:38:48Z</dc:date>
    </item>
    <item>
      <title>Hi Nick,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037028#M20466</link>
      <description>&lt;P&gt;Hi Nick,&lt;/P&gt;

&lt;P&gt;Thank you for reporting this issue. We will fix it soon in one of future releases of Intel MKL.&lt;/P&gt;

&lt;P&gt;You may want to use mkl_set_num_threads_local to set the number of threads for _all_ Intel MKL functions on the _current_ execution thread, or mkl_set_num_threads to set the number of threads for _all_ Intel MKL functions on _all_ execution threads, or mkl_domain_set_num_threads to set the number of threads for _specific_ Intel MKL functions on _all_ execution threads.&lt;/P&gt;

&lt;P&gt;Thank you!&lt;/P&gt;

&lt;P&gt;Evgueni.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Apr 2015 09:41:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037028#M20466</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2015-04-23T09:41:00Z</dc:date>
    </item>
    <item>
      <title>No problem, happy to help.  I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037029#M20467</link>
      <description>&lt;P&gt;No problem, happy to help. &amp;nbsp;I had a few follow up questions. &amp;nbsp;I noticed when using a complex DFT in the example provided if I change DFTI_ORDERED to DFTI_BACKWARD_SCRAMBLED the thread limit automatically gets set to 1 and only a single thread is used. &amp;nbsp;Is this behavior expected? &amp;nbsp;I couldn't find anything in the documentation about ordering impacting thread limit.&lt;/P&gt;

&lt;P&gt;Also, if I use two different DFT descriptor handles within the same programming scope and I want each to use a different number of threads (lets say 4 and 2 respectively), is it safe to call mkl_set_num_threads(4), commit the first DFT descriptor, then call mkl_set_num_threads(2) and commit the second descriptor then use both within a loop? &amp;nbsp;Will this effectively let the first handle use 4 threads and the second use 2 (assuming I have 6+ cpus to work with on the machine)?&lt;/P&gt;

&lt;P&gt;Cheers,&lt;/P&gt;

&lt;P&gt;Nick&lt;/P&gt;</description>
      <pubDate>Thu, 23 Apr 2015 21:58:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037029#M20467</guid>
      <dc:creator>nicpac22</dc:creator>
      <dc:date>2015-04-23T21:58:54Z</dc:date>
    </item>
    <item>
      <title>If DFTI_BACKWARD_SCRAMBLED is</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037030#M20468</link>
      <description>&lt;P&gt;If DFTI_BACKWARD_SCRAMBLED is set, 1D FFTs are not threaded, though they could be and they would possibly be faster than in the DFTI_ORDERED case. If performance of large 1D transforms is critical for your application, please request tuning the scrambled case at premier.intel.com.&lt;/P&gt;

&lt;P&gt;The workaround that you propose (calling mkl_set_num_threads)&amp;nbsp;should work with&amp;nbsp;the existing versions of Intel MKL but may stop working in the future.&lt;/P&gt;

&lt;P&gt;Thank you!&lt;/P&gt;

&lt;P&gt;Evgueni.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Apr 2015 14:08:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Complex-1-D-DFT-not-respecting-DFTI-THREAD-LIMIT/m-p/1037030#M20468</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2015-04-24T14:08:46Z</dc:date>
    </item>
  </channel>
</rss>

