<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic MKL Performance issue in threaded application in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-issue-in-threaded-application/m-p/1164882#M28174</link>
    <description>&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Hi&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;We are working on RNN kernel optimization and we are trying to parallel 2 SGEMM on 2 socket SKX6148 server( 20 core per socket).&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;The SGEMM size is M = 20， N = 2400， K = 800.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Our target is to map the first SGEMM to socket0 and the other SGEMM to socket1.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;We measured the GFLOPS with this benchmark(&lt;A href="https://github.com/xhzhao/GemmEfficiency/tree/tbb"&gt;https://github.com/xhzhao/GemmEfficiency/tree/tbb&lt;/A&gt;), and got the following performance data:&lt;/P&gt;

&lt;UL style="color: rgb(96, 96, 96);"&gt;
	&lt;LI&gt;OMP 1 x 40 core&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 2261 GFLOPS&amp;nbsp; &amp;nbsp; code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L120" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L120&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
	&lt;LI&gt;Pthread 2 * 20 core&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3550 GFLOPS&amp;nbsp; &amp;nbsp; code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L291" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L291&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
	&lt;LI&gt;OMP Nested 2 x 20 core&amp;nbsp; &amp;nbsp;1068 GFLOPS&amp;nbsp; &amp;nbsp; &amp;nbsp;code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L336" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L336&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
	&lt;LI&gt;TBB Nested 2 x 20 core&amp;nbsp; &amp;nbsp; &amp;nbsp;752 GFLOPS&amp;nbsp; &amp;nbsp; &amp;nbsp; code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;I found that the performance of OMP+MKL or TBB MKL is not as good as we expect, and i'm not sure if i miss something with MKL in threaded application.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;BTW, the pthread+MKL solution is not suitable for our real case , as it will double the threads and make the performance even worse.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Thanks in advance.&lt;/P&gt;</description>
    <pubDate>Thu, 02 Aug 2018 01:43:16 GMT</pubDate>
    <dc:creator>Xiaohui_Z_Intel</dc:creator>
    <dc:date>2018-08-02T01:43:16Z</dc:date>
    <item>
      <title>MKL Performance issue in threaded application</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-issue-in-threaded-application/m-p/1164882#M28174</link>
      <description>&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Hi&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;We are working on RNN kernel optimization and we are trying to parallel 2 SGEMM on 2 socket SKX6148 server( 20 core per socket).&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;The SGEMM size is M = 20， N = 2400， K = 800.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Our target is to map the first SGEMM to socket0 and the other SGEMM to socket1.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;We measured the GFLOPS with this benchmark(&lt;A href="https://github.com/xhzhao/GemmEfficiency/tree/tbb"&gt;https://github.com/xhzhao/GemmEfficiency/tree/tbb&lt;/A&gt;), and got the following performance data:&lt;/P&gt;

&lt;UL style="color: rgb(96, 96, 96);"&gt;
	&lt;LI&gt;OMP 1 x 40 core&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 2261 GFLOPS&amp;nbsp; &amp;nbsp; code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L120" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L120&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
	&lt;LI&gt;Pthread 2 * 20 core&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3550 GFLOPS&amp;nbsp; &amp;nbsp; code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L291" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L291&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
	&lt;LI&gt;OMP Nested 2 x 20 core&amp;nbsp; &amp;nbsp;1068 GFLOPS&amp;nbsp; &amp;nbsp; &amp;nbsp;code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L336" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L336&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
	&lt;LI&gt;TBB Nested 2 x 20 core&amp;nbsp; &amp;nbsp; &amp;nbsp;752 GFLOPS&amp;nbsp; &amp;nbsp; &amp;nbsp; code:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159" target="_blank"&gt;https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;I found that the performance of OMP+MKL or TBB MKL is not as good as we expect, and i'm not sure if i miss something with MKL in threaded application.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;BTW, the pthread+MKL solution is not suitable for our real case , as it will double the threads and make the performance even worse.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Aug 2018 01:43:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-issue-in-threaded-application/m-p/1164882#M28174</guid>
      <dc:creator>Xiaohui_Z_Intel</dc:creator>
      <dc:date>2018-08-02T01:43:16Z</dc:date>
    </item>
  </channel>
</rss>

