<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Warming up strategy for MIC dgemm call in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004306#M18805</link>
    <description>&lt;P&gt;In my computation, I manually offload some computation to MIC using offload pragmas. &amp;nbsp;Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which &amp;nbsp;is &amp;nbsp;recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) . &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12.7272720336914px; line-height: 17.7381820678711px;"&gt;Before the actual computation is started,&amp;nbsp;&lt;/SPAN&gt;I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200).&amp;nbsp;&lt;SPAN style="font-size: 12.7272720336914px; line-height: 17.7381820678711px;"&gt;Even after the warm up call, I observe that for some dgemm computation &amp;nbsp;still performance is unexpectedly low.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766&lt;BR /&gt;
	k0 =2, m 2405 n 903 ,k 192, flop rate 440.115&lt;BR /&gt;
	k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244&lt;BR /&gt;
	k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45&lt;BR /&gt;
	k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697&lt;BR /&gt;
	k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189&lt;/P&gt;

&lt;P&gt;Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. ( &amp;nbsp;k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.&lt;/P&gt;

&lt;P&gt;I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated &amp;nbsp;by performing a series of &amp;nbsp;warm up of suitable sizes.)&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC, &amp;nbsp;this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 25 Sep 2014 10:40:09 GMT</pubDate>
    <dc:creator>piyush_s_</dc:creator>
    <dc:date>2014-09-25T10:40:09Z</dc:date>
    <item>
      <title>Warming up strategy for MIC dgemm call</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004306#M18805</link>
      <description>&lt;P&gt;In my computation, I manually offload some computation to MIC using offload pragmas. &amp;nbsp;Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which &amp;nbsp;is &amp;nbsp;recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) . &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12.7272720336914px; line-height: 17.7381820678711px;"&gt;Before the actual computation is started,&amp;nbsp;&lt;/SPAN&gt;I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200).&amp;nbsp;&lt;SPAN style="font-size: 12.7272720336914px; line-height: 17.7381820678711px;"&gt;Even after the warm up call, I observe that for some dgemm computation &amp;nbsp;still performance is unexpectedly low.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766&lt;BR /&gt;
	k0 =2, m 2405 n 903 ,k 192, flop rate 440.115&lt;BR /&gt;
	k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244&lt;BR /&gt;
	k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45&lt;BR /&gt;
	k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697&lt;BR /&gt;
	k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189&lt;/P&gt;

&lt;P&gt;Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. ( &amp;nbsp;k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.&lt;/P&gt;

&lt;P&gt;I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated &amp;nbsp;by performing a series of &amp;nbsp;warm up of suitable sizes.)&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC, &amp;nbsp;this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Sep 2014 10:40:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004306#M18805</guid>
      <dc:creator>piyush_s_</dc:creator>
      <dc:date>2014-09-25T10:40:09Z</dc:date>
    </item>
    <item>
      <title>Small values of k definitely</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004307#M18806</link>
      <description>&lt;P&gt;Small values of k definitely will limit performance of MIC DGEMM.&amp;nbsp; In a relatively naive implementation, the k value would limit the number of threads.&amp;nbsp; Even though the current MIC DGEMM apparently has means to use a number of threads exceeding the value of k, it doesn't seem to be as effective as it is when k is several times the number of threads.&lt;/P&gt;

&lt;P&gt;The recommended Automatic Offoad scheme is supposed to keep the DGEMM on host when m, n, or k aren't sufficiently large to overcome the overhead of offloading.&lt;/P&gt;

&lt;P&gt;We have observed a warmup effect in MIC native operation as well.&amp;nbsp; It seemed to be associated with serialization of memory allocation.&lt;/P&gt;</description>
      <pubDate>Thu, 25 Sep 2014 11:15:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004307#M18806</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-09-25T11:15:11Z</dc:date>
    </item>
    <item>
      <title>I understand if I get 16 GF/s</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004308#M18807</link>
      <description>&lt;P&gt;I understand if I get 16 GF/s for k=2 (as it is a memory bandwidth bound computation and you might utilize only 1/4 of simd ) but not 1.5 GF/s. Coming back to my original question, Given I'd encounter many dgemms of sizes &amp;nbsp;0&amp;lt;m,n&amp;lt;10000 and 0&amp;lt;k&amp;lt;200, what can I do to prevent such anomalous performance.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Sep 2014 19:33:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Warming-up-strategy-for-MIC-dgemm-call/m-p/1004308#M18807</guid>
      <dc:creator>piyush_s_</dc:creator>
      <dc:date>2014-09-25T19:33:54Z</dc:date>
    </item>
  </channel>
</rss>

