<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Not knowing whether you are in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Benchmarking-algorithms-on-Intel-Xeon-Gold-DevCloud/m-p/1163355#M7977</link>
    <description>&lt;P&gt;Not knowing whether you are looking for something applicable to all Intel CPUs, yes, we can confirm that the Intel Xeon Phi KNC was particularly slow in setting up data structures the first time (easily an extra half second or so the first time).&amp;nbsp; Rather than run a large number of repetitions and average the fast and slow iterations, you may consider the typical tactic of running 1 or 2 iterations for warm-up before running the timed code.&amp;nbsp; Any CPU is likely to incur more last level cache misses the first time a data region is entered, so you must consider whether you want to include these in your benchmark timing (if possible) or exclude them.&lt;/P&gt;</description>
    <pubDate>Tue, 03 Apr 2018 03:57:40 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2018-04-03T03:57:40Z</dc:date>
    <item>
      <title>Benchmarking algorithms on Intel Xeon Gold (DevCloud)</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Benchmarking-algorithms-on-Intel-Xeon-Gold-DevCloud/m-p/1163354#M7976</link>
      <description>&lt;P style="margin-bottom: 0px; border: 0px; font-size: 14px; font-family: intel-clear, arial, helvetica, &amp;quot;helvetica neue&amp;quot;, verdana, sans-serif; vertical-align: baseline; color: rgb(61, 61, 61);"&gt;This post is regarding benchmarking algorithms on the Intel Xeon processors.&lt;/P&gt;

&lt;P style="margin-bottom: 0px; border: 0px; font-size: 14px; font-family: intel-clear, arial, helvetica, &amp;quot;helvetica neue&amp;quot;, verdana, sans-serif; vertical-align: baseline; color: rgb(61, 61, 61);"&gt;&lt;A class="jive-link-external-small" href="https://communities.intel.com/external-link.jspa?url=https%3A%2F%2Fsoftware.intel.com%2Fen-us%2Farticles%2Fperformance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system" rel="nofollow" style="padding-right: calc(12px + 0.35ex); border: 0px; font-weight: inherit; font-style: inherit; vertical-align: baseline; color: rgb(0, 113, 197);" target="_blank"&gt;Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System | Intel® Software&lt;/A&gt;&lt;/P&gt;

&lt;P style="margin-bottom: 1.06667em; border: 0px; font-size: 15px; font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; vertical-align: baseline; color: rgb(85, 85, 85);"&gt;I have been attempting to reproduce the benchmarks as provided in the code from the article above. Specifically mmatest1.c from the zip file attached in the article. One observation I have is that there is a considerable warm-up time which leads to big overhead on the first algorithm being benchmarked. (In this case, the cblas_sgemm function.)&lt;/P&gt;

&lt;P style="margin-bottom: 1.06667em; border: 0px; font-size: 15px; font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; vertical-align: baseline; color: rgb(85, 85, 85);"&gt;16 loop counts are often not enough to offset the thread 'warm-up' time. I am not sure what the correct terminology for this would be.&lt;/P&gt;

&lt;OL&gt;
	&lt;LI style="margin-bottom: 1.06667em; border: 0px; font-size: 15px; font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; vertical-align: baseline; color: rgb(85, 85, 85);"&gt;&lt;STRONG&gt;Can anyone confirm this? When benchmarking, is it better to give a 'warm-up' kernel to the threads?&lt;/STRONG&gt;&lt;/LI&gt;
	&lt;LI style="margin-bottom: 1.06667em; border: 0px; font-size: 15px; font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; vertical-align: baseline; color: rgb(85, 85, 85);"&gt;&lt;STRONG&gt;Where can i read up more on this?&lt;/STRONG&gt;&lt;/LI&gt;
	&lt;LI style="margin-bottom: 1.06667em; border: 0px; font-size: 15px; font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; vertical-align: baseline; color: rgb(85, 85, 85);"&gt;&lt;STRONG&gt;Can anyone also suggest the best way/algorithm/function to access sub matrices of size (MxM) from a larger matrix?&amp;nbsp;&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;

&lt;P style="margin-bottom: 1.06667em; border: 0px; font-size: 15px; font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; vertical-align: baseline; color: rgb(85, 85, 85);"&gt;To review my code, kindly refer to:&amp;nbsp;&lt;A class="jive-link-external-small" href="https://communities.intel.com/external-link.jspa?url=https%3A%2F%2Fgithub.com%2Fakhauriyash%2FXNOR-Nets" rel="nofollow" style="padding-right: calc(12px + 0.35ex); border: 0px; font-weight: inherit; font-style: inherit; vertical-align: baseline; color: rgb(0, 113, 197); font-family: intel-clear, arial, helvetica, &amp;quot;helvetica neue&amp;quot;, verdana, sans-serif !important;" target="_blank"&gt;GitHub - akhauriyash/XNOR-Nets: An OpenMP parallelized implementation of XNOR kernels.&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Apr 2018 05:40:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Benchmarking-algorithms-on-Intel-Xeon-Gold-DevCloud/m-p/1163354#M7976</guid>
      <dc:creator>YAkha</dc:creator>
      <dc:date>2018-04-02T05:40:11Z</dc:date>
    </item>
    <item>
      <title>Not knowing whether you are</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Benchmarking-algorithms-on-Intel-Xeon-Gold-DevCloud/m-p/1163355#M7977</link>
      <description>&lt;P&gt;Not knowing whether you are looking for something applicable to all Intel CPUs, yes, we can confirm that the Intel Xeon Phi KNC was particularly slow in setting up data structures the first time (easily an extra half second or so the first time).&amp;nbsp; Rather than run a large number of repetitions and average the fast and slow iterations, you may consider the typical tactic of running 1 or 2 iterations for warm-up before running the timed code.&amp;nbsp; Any CPU is likely to incur more last level cache misses the first time a data region is entered, so you must consider whether you want to include these in your benchmark timing (if possible) or exclude them.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Apr 2018 03:57:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Benchmarking-algorithms-on-Intel-Xeon-Gold-DevCloud/m-p/1163355#M7977</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2018-04-03T03:57:40Z</dc:date>
    </item>
  </channel>
</rss>

