<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic outter loop openMP + inner loop vectorization vs MKL in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/outter-loop-openMP-inner-loop-vectorization-vs-MKL/m-p/862809#M7614</link>
    <description>Hi, &lt;BR /&gt;&lt;BR /&gt;I have writen a simple code to implement b=A*x and test it on the machine with 2 quad-core cpus. While compiling, Its outter loop is openMP parallized and innter loop is vectorized. &lt;BR /&gt;&lt;BR /&gt;#pragma omp parallel for&lt;BR /&gt; for(int i=0; i&lt;N&gt;&lt;/N&gt; for (int j=0; j&lt;N&gt;&lt;/N&gt; b&lt;I&gt;+=A[i*N+j]*x&lt;J&gt;;&lt;BR /&gt; }&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;while setting the number of thread to be 8 N=20000 and &lt;SPAN class="Code"&gt;KMP_AFFINITY=verbose,&lt;BR /&gt;&lt;BR /&gt;KMP_AFFINITY: Affinity capable, using global cpuid instr info&lt;BR /&gt;KMP_AFFINITY: Initial OS proc set respected:&lt;BR /&gt;{0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: 8 available OS procs - Uniform topology of&lt;BR /&gt;KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)&lt;BR /&gt;KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;vec mv time: 90.882004 ms &lt;BR /&gt;&lt;BR /&gt;which is a little faster than mkl sgemv. Set the thread to be 4, it is ~102 ms. &lt;B&gt;Why only slight improvement while double the number of threads?&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;However, &lt;BR /&gt;&lt;BR /&gt;when I set &lt;/SPAN&gt;&lt;SPAN class="Code"&gt;KMP_AFFINITY=verbose,compact (in this case mkl sgemv has the best performance ~92 ms),&lt;BR /&gt;&lt;BR /&gt;the timing of the above code changes a lot:&lt;BR /&gt;&lt;BR /&gt;KMP_AFFINITY: Affinity capable, using global cpuid instr info&lt;BR /&gt;KMP_AFFINITY: Initial OS proc set respected:&lt;BR /&gt;{0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: 8 available OS procs - Uniform topology of&lt;BR /&gt;KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)&lt;BR /&gt;KMP_AFFINITY: OS proc to physical thread map ([] =&amp;gt; level not in map):&lt;BR /&gt;KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}&lt;BR /&gt;KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}&lt;BR /&gt;KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}&lt;BR /&gt;KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}&lt;BR /&gt;KMP_AFFINITY: Internal thread 4 bound to OS proc set {4}&lt;BR /&gt;KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}&lt;BR /&gt;KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}&lt;BR /&gt;KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}&lt;BR /&gt;&lt;BR /&gt; vec mv time: 284.753998&lt;BR /&gt;&lt;BR /&gt;set KMP_AFFINITY=verbose, scatter then I have the improvement from 284ms to 137ms but still much worse than 90ms!!!&lt;BR /&gt;&lt;BR /&gt;KMP_AFFINITY: Affinity capable, using global cpuid instr info&lt;BR /&gt;KMP_AFFINITY: Initial OS proc set respected:&lt;BR /&gt;{0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: 8 available OS procs - Uniform topology of&lt;BR /&gt;KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)&lt;BR /&gt;KMP_AFFINITY: OS proc to physical thread map ([] =&amp;gt; level not in map):&lt;BR /&gt;KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}&lt;BR /&gt;KMP_AFFINITY: Internal thread 1 bound to OS proc set {4}&lt;BR /&gt;KMP_AFFINITY: Internal thread 2 bound to OS proc set {1}&lt;BR /&gt;KMP_AFFINITY: Internal thread 3 bound to OS proc set {5}&lt;BR /&gt;KMP_AFFINITY: Internal thread 4 bound to OS proc set {2}&lt;BR /&gt;KMP_AFFINITY: Internal thread 5 bound to OS proc set {6}&lt;BR /&gt;KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}&lt;BR /&gt;KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; vec mv time: 137.539993&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="Code"&gt;&lt;B&gt; why is poor performance by setting how the threads distributed among the cores? &lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;by setting KMP_AFFINITY = scatter/compact, it seems that openMP + vectorization perform WORSE that just vectorization of the inner loop!&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/J&gt;&lt;/I&gt;</description>
    <pubDate>Sun, 13 Dec 2009 03:51:32 GMT</pubDate>
    <dc:creator>pilot117</dc:creator>
    <dc:date>2009-12-13T03:51:32Z</dc:date>
    <item>
      <title>outter loop openMP + inner loop vectorization vs MKL</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/outter-loop-openMP-inner-loop-vectorization-vs-MKL/m-p/862809#M7614</link>
      <description>Hi, &lt;BR /&gt;&lt;BR /&gt;I have writen a simple code to implement b=A*x and test it on the machine with 2 quad-core cpus. While compiling, Its outter loop is openMP parallized and innter loop is vectorized. &lt;BR /&gt;&lt;BR /&gt;#pragma omp parallel for&lt;BR /&gt; for(int i=0; i&lt;N&gt;&lt;/N&gt; for (int j=0; j&lt;N&gt;&lt;/N&gt; b&lt;I&gt;+=A[i*N+j]*x&lt;J&gt;;&lt;BR /&gt; }&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;while setting the number of thread to be 8 N=20000 and &lt;SPAN class="Code"&gt;KMP_AFFINITY=verbose,&lt;BR /&gt;&lt;BR /&gt;KMP_AFFINITY: Affinity capable, using global cpuid instr info&lt;BR /&gt;KMP_AFFINITY: Initial OS proc set respected:&lt;BR /&gt;{0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: 8 available OS procs - Uniform topology of&lt;BR /&gt;KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)&lt;BR /&gt;KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;vec mv time: 90.882004 ms &lt;BR /&gt;&lt;BR /&gt;which is a little faster than mkl sgemv. Set the thread to be 4, it is ~102 ms. &lt;B&gt;Why only slight improvement while double the number of threads?&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;However, &lt;BR /&gt;&lt;BR /&gt;when I set &lt;/SPAN&gt;&lt;SPAN class="Code"&gt;KMP_AFFINITY=verbose,compact (in this case mkl sgemv has the best performance ~92 ms),&lt;BR /&gt;&lt;BR /&gt;the timing of the above code changes a lot:&lt;BR /&gt;&lt;BR /&gt;KMP_AFFINITY: Affinity capable, using global cpuid instr info&lt;BR /&gt;KMP_AFFINITY: Initial OS proc set respected:&lt;BR /&gt;{0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: 8 available OS procs - Uniform topology of&lt;BR /&gt;KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)&lt;BR /&gt;KMP_AFFINITY: OS proc to physical thread map ([] =&amp;gt; level not in map):&lt;BR /&gt;KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}&lt;BR /&gt;KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}&lt;BR /&gt;KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}&lt;BR /&gt;KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}&lt;BR /&gt;KMP_AFFINITY: Internal thread 4 bound to OS proc set {4}&lt;BR /&gt;KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}&lt;BR /&gt;KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}&lt;BR /&gt;KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}&lt;BR /&gt;&lt;BR /&gt; vec mv time: 284.753998&lt;BR /&gt;&lt;BR /&gt;set KMP_AFFINITY=verbose, scatter then I have the improvement from 284ms to 137ms but still much worse than 90ms!!!&lt;BR /&gt;&lt;BR /&gt;KMP_AFFINITY: Affinity capable, using global cpuid instr info&lt;BR /&gt;KMP_AFFINITY: Initial OS proc set respected:&lt;BR /&gt;{0,1,2,3,4,5,6,7}&lt;BR /&gt;KMP_AFFINITY: 8 available OS procs - Uniform topology of&lt;BR /&gt;KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)&lt;BR /&gt;KMP_AFFINITY: OS proc to physical thread map ([] =&amp;gt; level not in map):&lt;BR /&gt;KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 1 maps to package 0 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 3 maps to package 0 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 4 maps to package 1 core 0 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 6 maps to package 1 core 2 [thread 0]&lt;BR /&gt;KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]&lt;BR /&gt;KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}&lt;BR /&gt;KMP_AFFINITY: Internal thread 1 bound to OS proc set {4}&lt;BR /&gt;KMP_AFFINITY: Internal thread 2 bound to OS proc set {1}&lt;BR /&gt;KMP_AFFINITY: Internal thread 3 bound to OS proc set {5}&lt;BR /&gt;KMP_AFFINITY: Internal thread 4 bound to OS proc set {2}&lt;BR /&gt;KMP_AFFINITY: Internal thread 5 bound to OS proc set {6}&lt;BR /&gt;KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}&lt;BR /&gt;KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; vec mv time: 137.539993&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="Code"&gt;&lt;B&gt; why is poor performance by setting how the threads distributed among the cores? &lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;by setting KMP_AFFINITY = scatter/compact, it seems that openMP + vectorization perform WORSE that just vectorization of the inner loop!&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/J&gt;&lt;/I&gt;</description>
      <pubDate>Sun, 13 Dec 2009 03:51:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/outter-loop-openMP-inner-loop-vectorization-vs-MKL/m-p/862809#M7614</guid>
      <dc:creator>pilot117</dc:creator>
      <dc:date>2009-12-13T03:51:32Z</dc:date>
    </item>
    <item>
      <title>outter loop openMP + inner loop vectorization vs MKL</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/outter-loop-openMP-inner-loop-vectorization-vs-MKL/m-p/862810#M7615</link>
      <description>&lt;P&gt;pilot177,&lt;/P&gt;
&lt;P&gt;First, please let me apologize that nobody has responded to your post sooner.&lt;/P&gt;
&lt;P&gt;Although it is difficult to be sure, I would guess that you are timing the entire program including the time it takes to bind the threads to processors using KMP_AFFINITY. If you put an empty parallel region before your code (or maybe a parallel region that just prints omp_get_thread_num() result to prevent the parallel region getting removed by the compiler), then the binding will happen at the first (dummy) parallel region. Then, by putting timing calls before and after the parallel region with the actual work, you should see much better times because you will not be including the time it takes to bind each thread to its processor.&lt;/P&gt;
&lt;P&gt;Another option is to make theb=A*x arraysmuch larger to amortize the time it takes to bind the threads to processors. I think this should give you a better idea of the effect that the thread binding has on the computation, rather than the time it takes to do the binding itself.&lt;/P&gt;
&lt;P&gt;Hope this helps,&lt;/P&gt;
&lt;P&gt;- Grant&lt;/P&gt;</description>
      <pubDate>Fri, 12 Feb 2010 16:26:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/outter-loop-openMP-inner-loop-vectorization-vs-MKL/m-p/862810#M7615</guid>
      <dc:creator>Grant_H_Intel</dc:creator>
      <dc:date>2010-02-12T16:26:07Z</dc:date>
    </item>
  </channel>
</rss>

