<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic what is the problem size? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125873#M25237</link>
    <description>&lt;P&gt;what is the problem size?&lt;/P&gt;

&lt;P&gt;could you share the example to play with that on our side?&lt;/P&gt;

&lt;P&gt;what is the version of mkl do you use?&lt;/P&gt;

&lt;P&gt;how did you link?&lt;/P&gt;</description>
    <pubDate>Sat, 16 Sep 2017 04:01:14 GMT</pubDate>
    <dc:creator>Gennady_F_Intel</dc:creator>
    <dc:date>2017-09-16T04:01:14Z</dc:date>
    <item>
      <title>performance problem of MKL in multithreading application</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125872#M25236</link>
      <description>&lt;P&gt;hi there,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;We are using MKL on a RedHat Linux Network Server with&amp;nbsp;&lt;SPAN style="color: rgb(69, 69, 69); font-family: &amp;quot;Helvetica Neue&amp;quot;; font-size: 12px;"&gt;Xeon&lt;/SPAN&gt;&lt;SPAN style="color: rgb(69, 69, 69); font-size: 12px;"&gt;&lt;FONT face=".PingFang SC"&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(69, 69, 69); font-family: &amp;quot;Helvetica Neue&amp;quot;; font-size: 12px;"&gt;Processor which has&amp;nbsp;32&amp;nbsp;Physical Core (64&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(69, 69, 69); font-family: &amp;quot;Helvetica Neue&amp;quot;; font-size: 12px;"&gt;Logical Core).&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;The Application uses a thread pool to handle network requests in parallel. &lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Each request is handled independently. The performance improves with more threads:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;4 threads &amp;nbsp; : 45 seconds&amp;nbsp;&lt;BR /&gt;
	8 threads &amp;nbsp; : 23 seconds&lt;BR /&gt;
	16 threads : 15 seconds&lt;BR /&gt;
	24 threads : 14 seconds&lt;BR /&gt;
	32 threads : 15 seconds&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;However, the performance always caps at 16 threads, and drops a little bit with 32 threads. I replace the mkl&amp;nbsp;cblas_sgemm function with atlas, then the performance keeps improving from 1 thread to 32 threads&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;linearly.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;And limit the mkl&amp;nbsp;thread count by calling mkl_set_num_threads(1)&amp;nbsp;at the beginning of main function or set environment variable to 1, also doesn't work and get the same result. The multiprocess solution also have the same problem(??). Another&amp;nbsp;experiment which sleeps a small amount of time before calling mkl cblas_sgemm shows linear but not ideal result.&amp;nbsp;It looks like there are some resource contention inside the MKL cblas_sgemm implementation? Or do we miss anything here?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Any comment or suggestion is highly appreciated! And thanks much in advance!&lt;/P&gt;

&lt;P class="p1"&gt;Thanks,&lt;/P&gt;

&lt;P class="p1"&gt;Yu&lt;/P&gt;

&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;
&lt;STYLE type="text/css"&gt;p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #454545}
span.s1 {font: 12.0px '.PingFang SC'}
&lt;/STYLE&gt;</description>
      <pubDate>Fri, 15 Sep 2017 17:23:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125872#M25236</guid>
      <dc:creator>Yu_L_1</dc:creator>
      <dc:date>2017-09-15T17:23:05Z</dc:date>
    </item>
    <item>
      <title>what is the problem size?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125873#M25237</link>
      <description>&lt;P&gt;what is the problem size?&lt;/P&gt;

&lt;P&gt;could you share the example to play with that on our side?&lt;/P&gt;

&lt;P&gt;what is the version of mkl do you use?&lt;/P&gt;

&lt;P&gt;how did you link?&lt;/P&gt;</description>
      <pubDate>Sat, 16 Sep 2017 04:01:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125873#M25237</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2017-09-16T04:01:14Z</dc:date>
    </item>
    <item>
      <title>Hi Gennady, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125874#M25238</link>
      <description>&lt;P&gt;Hi Gennady,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks for quick response! Sorry, I can't share the code with you according to the company policy. The problematic part is the first convolution layer which requires 256*256*3 image as input. We are using MKL 11.1 and linked with -l/3rdparty/libmkl.a&lt;/P&gt;</description>
      <pubDate>Sat, 16 Sep 2017 13:46:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125874#M25238</guid>
      <dc:creator>Yu_L_1</dc:creator>
      <dc:date>2017-09-16T13:46:18Z</dc:date>
    </item>
    <item>
      <title>Hi Yu, 1/ I am not asking you</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125875#M25239</link>
      <description>&lt;P&gt;Hi Yu,&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;1/ I am not asking you to share the private code, but you may create the simplest sgemm example which will show the problem. 2/&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;Am i understand right, that problem sizes in your cases are ~256 x 256? &amp;nbsp; &amp;nbsp; 3/ version 11.1 is 5 years old version of MKL. Could you check the latest MKL 2017 u3 or the newest 2018? you may download these binaries for free. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 17 Sep 2017 04:44:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125875#M25239</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2017-09-17T04:44:56Z</dc:date>
    </item>
    <item>
      <title>Hi Gennady, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125876#M25240</link>
      <description>&lt;P&gt;Hi Gennady,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks for the comments! &amp;nbsp;And yes, the input image size for the convolution layer is 256*256*3 channels. but the actual problem size might be much bigger according to the neural network algorithm. &amp;nbsp;I tried mkl 2018 on the machine and get the same result. But this time, the VT&lt;SPAN style="font-size: 1em;"&gt;une gave us a very clear report about the memory bandwidth. After some more digging, we think the bottleneck should be the memory bandwidth. MKL does such an excellent job in optimization and fully utilizes the memory bandwidth than atlas does. What's probably why the CPU usage always caps at 16. And it also answers why the multiprocess solution also has the same problem. Thanks again :)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Thanks,&lt;BR /&gt;
	Yu&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Sep 2017 08:15:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/performance-problem-of-MKL-in-multithreading-application/m-p/1125876#M25240</guid>
      <dc:creator>Yu_L_1</dc:creator>
      <dc:date>2017-09-18T08:15:50Z</dc:date>
    </item>
  </channel>
</rss>

