<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic change line 21 to: in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139800#M78211</link>
    <description>&lt;P&gt;change line 21 to:&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for(int iRep=0; iRep&amp;lt;3; ++iRep) {&lt;/P&gt;

&lt;P&gt;and see what happens.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Wed, 05 Jul 2017 14:56:27 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2017-07-05T14:56:27Z</dc:date>
    <item>
      <title>Intel MKL performance drop OpenMP vs TBB</title>
      <link>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139799#M78210</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;

&lt;P&gt;I tried the below example program on KNL and I am puzzled about the huge performance difference. It computes a small matrix-matrix product using the MKL. In this (naive) example there is a 1000x performance difference when switching from OpenMP to TBB. The file was compiled with&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt; icc -std=c++11 -O3 -xmic-avx512 -mkl -qopenmp tbb_vs_omp.cpp -o omp
 icc -std=c++11 -O3 -xmic-avx512 -mkl -tbb tbb_vs_omp.cpp -o tbb&lt;/PRE&gt;

&lt;P&gt;I tried a few things, e.g. using tbb::task_scheduler_init or OpenMP env variables, but nothing seems to make the TBB version nearly as fast as the OpenMP version, or the OpenMP version as slow. Does anyone know what might the problem and how to fix it, that is how to configure TBB? The gap gets smaller when increasing the problem size (only 10x for N=1024).&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;iostream&amp;gt;

#include &amp;lt;mkl.h&amp;gt;

constexpr size_t N    = 64;
constexpr size_t RUNS = 20;

int main() {
  double* A = (double*)_mm_malloc(N * N * sizeof(double), 64);
  double* B = (double*)_mm_malloc(N * N * sizeof(double), 64);
  double* C = (double*)_mm_malloc(N * N * sizeof(double), 64);

  VSLStreamStatePtr stream;
  vslNewStream(&amp;amp;stream, VSL_BRNG_SFMT19937, 1337);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10);
  vslDeleteStream(&amp;amp;stream);

  std::cout &amp;lt;&amp;lt; "Created matrices, N = " &amp;lt;&amp;lt; N &amp;lt;&amp;lt; ".\n";

  {
    double total = 0.0;
    cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
                CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B,
                N /* ldb */, 0.0, C, N /* ldc */);
    for (size_t i = 0; i &amp;lt; RUNS; ++i) {
      // A[0] = i;
      double start = dsecnd();
      cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
                  CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */,
                  B, N /* ldb */, 0.0, C, N /* ldc */);
      total += dsecnd() - start;
    }
    std::cout &amp;lt;&amp;lt; "Time needed " &amp;lt;&amp;lt; total &amp;lt;&amp;lt; ", ";
  }

  std::cout &amp;lt;&amp;lt; C[0] &amp;lt;&amp;lt; '\n';

  _mm_free(A);
  _mm_free(B);
  _mm_free(C);
  return 0;
}
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Jul 2017 18:21:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139799#M78210</guid>
      <dc:creator>Jannik</dc:creator>
      <dc:date>2017-07-04T18:21:28Z</dc:date>
    </item>
    <item>
      <title>change line 21 to:</title>
      <link>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139800#M78211</link>
      <description>&lt;P&gt;change line 21 to:&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for(int iRep=0; iRep&amp;lt;3; ++iRep) {&lt;/P&gt;

&lt;P&gt;and see what happens.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 05 Jul 2017 14:56:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139800#M78211</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-07-05T14:56:27Z</dc:date>
    </item>
    <item>
      <title>thank you, I tried this and</title>
      <link>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139801#M78212</link>
      <description>&lt;P&gt;thank you, I tried this and the next calls are faster, but there is still a huge difference. Some numbers (all examples ran on KNL):&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;TBB: First loop 0.2s, next loops around 0.015s&lt;/LI&gt;
	&lt;LI&gt;OMP around: 0.00055s each time&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Could you give me a hint why the next runs are faster? I expected the first call to be slow because of thread creation, but why does it take so many calls? On a i5 the two versions take about the same time.&lt;/P&gt;

&lt;P&gt;edit: same results using gcc.&lt;/P&gt;</description>
      <pubDate>Thu, 06 Jul 2017 12:18:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139801#M78212</guid>
      <dc:creator>Jannik</dc:creator>
      <dc:date>2017-07-06T12:18:00Z</dc:date>
    </item>
    <item>
      <title>See: https://software.intel</title>
      <link>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139802#M78213</link>
      <description>&lt;P&gt;See: &lt;A href="https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/281761"&gt;https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/281761&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;TBB may be incorporating a similar functionality of KMP_BLOCKTIME. The TBB threads may be consuming processing time as a result. As a verification, add (to the iRep version) a timed wait of 2 seconds before your timed section. This should assure that all non-master threads have suspended. But this will not assure that then (threaded version of) MKL thread pool was initiated. The first MKL call will incur the overhead of initiating the MKL thread pool.&lt;/P&gt;

&lt;P&gt;Lastly:&lt;/P&gt;

&lt;P&gt;Your timed section is too small to be effectively measured. Thread start/stop/barrier times when running with 64 to 256 threads is significant.&lt;/P&gt;

&lt;P&gt;A 64 x 64 doubles are relatively small arrays, and may even be too small to effectively use the parallel version of mkl. Assure that the sequential version of MKL is used for this test program&amp;nbsp;(-mkl:sequential)&lt;/P&gt;

&lt;P&gt;Also note:&lt;/P&gt;

&lt;P&gt;If you predominantly call MKL from &lt;STRONG&gt;multiple threads within TBB &lt;/STRONG&gt;(e.g. parallel_for and/or other concurrent task)...&lt;BR /&gt;
	... then link with the &lt;STRONG&gt;serial version of MKL&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;IOW assure that MKL does not spawn a new thread pool for each of its host's threads.&lt;/P&gt;

&lt;P&gt;If you predominantly call MKL from&amp;nbsp;&lt;STRONG&gt;a single&amp;nbsp;thread within TBB &lt;/STRONG&gt;(e.g. main thread or other dedicated thread)...&lt;BR /&gt;
	... then link with the&amp;nbsp;&lt;STRONG&gt;parallel version of MKL&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;While this may seem backwards, it is not. Both versions of MKL are thread-safe. The differentiation is if MKL is to spawn or not spawn a thread pool in the context of the calling thread.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 06 Jul 2017 15:00:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Intel-MKL-performance-drop-OpenMP-vs-TBB/m-p/1139802#M78213</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-07-06T15:00:14Z</dc:date>
    </item>
  </channel>
</rss>

