<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Sebastian, in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921275#M13391</link>
    <description>&lt;P&gt;Sebastian,&lt;/P&gt;

&lt;P&gt;Thanks for your question.&lt;/P&gt;

&lt;P&gt;I've looked at the code and noticed that the memory accesses are strided with a big stride. Xeon Phi would perform best with consecutive memory access pattern.&lt;/P&gt;

&lt;P&gt;Are&amp;nbsp;local and global sizes&amp;nbsp;did you use&amp;nbsp;in your measurements?&lt;/P&gt;

&lt;P&gt;More efficient approach for Xeon phi would be:&lt;/P&gt;

&lt;DIV class="line" id="LC93"&gt;&lt;SPAN class="n"&gt;AURA_KERNEL&lt;/SPAN&gt; &lt;SPAN class="kt"&gt;void&lt;/SPAN&gt; &lt;SPAN class="nf"&gt;peak_copy&lt;/SPAN&gt;&lt;SPAN class="p"&gt;(&lt;/SPAN&gt;&lt;SPAN class="n"&gt;AURA_GLOBAL&lt;/SPAN&gt; &lt;SPAN class="kt"&gt;float&lt;/SPAN&gt; &lt;SPAN class="o"&gt;*&lt;/SPAN&gt; &lt;SPAN class="n"&gt;dst&lt;/SPAN&gt;&lt;SPAN class="p"&gt;,&lt;/SPAN&gt; &lt;SPAN class="n"&gt;AURA_GLOBAL&lt;/SPAN&gt; &lt;SPAN class="kt"&gt;float&lt;/SPAN&gt; &lt;SPAN class="o"&gt;*&lt;/SPAN&gt; &lt;SPAN class="n"&gt;src&lt;/SPAN&gt;&lt;SPAN class="p"&gt;)&lt;/SPAN&gt; &lt;SPAN class="p"&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="line" id="LC94"&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN class="kt"&gt;int&lt;/SPAN&gt; &lt;SPAN class="n"&gt;id&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="n"&gt;get_global_id&lt;/SPAN&gt;&lt;SPAN class="p"&gt;(0);&amp;nbsp;&amp;nbsp; //can be extended for multiple dimensions&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="line" id="LC95"&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN class="n"&gt;dst&lt;/SPAN&gt;&lt;SPAN class="p"&gt;[&lt;/SPAN&gt;&lt;SPAN class="n"&gt;id&lt;/SPAN&gt;&lt;SPAN class="p"&gt;]&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="n"&gt;src&lt;/SPAN&gt;&lt;SPAN class="p"&gt;[&lt;/SPAN&gt;&lt;SPAN class="n"&gt;id&lt;/SPAN&gt;&lt;SPAN class="p"&gt;];&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="line" id="LC98"&gt;&amp;nbsp;&lt;SPAN class="p"&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;P&gt;Please use big local size (maximum supported is 8K). Please make sure that to create enough working groups (At the bare minimum the number of compute units).&lt;/P&gt;

&lt;P&gt;Please update here with your findings.&lt;/P&gt;

&lt;P&gt;Arik&lt;/P&gt;</description>
    <pubDate>Sun, 01 Dec 2013 11:39:57 GMT</pubDate>
    <dc:creator>Arik_N_Intel</dc:creator>
    <dc:date>2013-12-01T11:39:57Z</dc:date>
    <item>
      <title>OpenCL and Bandwidth</title>
      <link>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921272#M13388</link>
      <description>&lt;P&gt;I'm trying to get maximum/high memory bandwidth with a Stream like benchmark based on OpenCL. The maximum performance I am able to achieve seems to be about 35GB/s. With the same benchmark on Nvidia Titan and AMD W9000 I get close to the peak performance.&lt;/P&gt;

&lt;P&gt;Has anybody implemented a steam like benchmark for Intel MIC using OpenCL and sees good performance?&lt;/P&gt;

&lt;P&gt;Thanks, Sebastian&lt;/P&gt;</description>
      <pubDate>Wed, 27 Nov 2013 14:17:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921272#M13388</guid>
      <dc:creator>Sebastian_S_</dc:creator>
      <dc:date>2013-11-27T14:17:33Z</dc:date>
    </item>
    <item>
      <title>Just as an update, the kernel</title>
      <link>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921273#M13389</link>
      <description>&lt;P&gt;Just as an update, the kernel code I used can be found here: &lt;A href="https://github.com/sschaetz/aura/blob/a72fbf56470c553794f0d20da1354d31c7a925be/bench/peak.cc" target="_blank"&gt;https://github.com/sschaetz/aura/blob/a72fbf56470c553794f0d20da1354d31c7a925be/bench/peak.cc&lt;/A&gt; (kernels &lt;SPAN class="nf"&gt;peak_copy&lt;/SPAN&gt;, peak_scale etc).&lt;/P&gt;</description>
      <pubDate>Thu, 28 Nov 2013 08:00:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921273#M13389</guid>
      <dc:creator>Sebastian_S_</dc:creator>
      <dc:date>2013-11-28T08:00:20Z</dc:date>
    </item>
    <item>
      <title>Sebastian,</title>
      <link>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921274#M13390</link>
      <description>&lt;P&gt;Sebastian,&lt;/P&gt;

&lt;P&gt;Things are pretty quiet here so I won't be able to get you an answer until next week.&lt;/P&gt;

&lt;P&gt;--&lt;BR /&gt;
	Taylor&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 30 Nov 2013 01:47:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921274#M13390</guid>
      <dc:creator>TaylorIoTKidd</dc:creator>
      <dc:date>2013-11-30T01:47:11Z</dc:date>
    </item>
    <item>
      <title>Sebastian,</title>
      <link>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921275#M13391</link>
      <description>&lt;P&gt;Sebastian,&lt;/P&gt;

&lt;P&gt;Thanks for your question.&lt;/P&gt;

&lt;P&gt;I've looked at the code and noticed that the memory accesses are strided with a big stride. Xeon Phi would perform best with consecutive memory access pattern.&lt;/P&gt;

&lt;P&gt;Are&amp;nbsp;local and global sizes&amp;nbsp;did you use&amp;nbsp;in your measurements?&lt;/P&gt;

&lt;P&gt;More efficient approach for Xeon phi would be:&lt;/P&gt;

&lt;DIV class="line" id="LC93"&gt;&lt;SPAN class="n"&gt;AURA_KERNEL&lt;/SPAN&gt; &lt;SPAN class="kt"&gt;void&lt;/SPAN&gt; &lt;SPAN class="nf"&gt;peak_copy&lt;/SPAN&gt;&lt;SPAN class="p"&gt;(&lt;/SPAN&gt;&lt;SPAN class="n"&gt;AURA_GLOBAL&lt;/SPAN&gt; &lt;SPAN class="kt"&gt;float&lt;/SPAN&gt; &lt;SPAN class="o"&gt;*&lt;/SPAN&gt; &lt;SPAN class="n"&gt;dst&lt;/SPAN&gt;&lt;SPAN class="p"&gt;,&lt;/SPAN&gt; &lt;SPAN class="n"&gt;AURA_GLOBAL&lt;/SPAN&gt; &lt;SPAN class="kt"&gt;float&lt;/SPAN&gt; &lt;SPAN class="o"&gt;*&lt;/SPAN&gt; &lt;SPAN class="n"&gt;src&lt;/SPAN&gt;&lt;SPAN class="p"&gt;)&lt;/SPAN&gt; &lt;SPAN class="p"&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="line" id="LC94"&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN class="kt"&gt;int&lt;/SPAN&gt; &lt;SPAN class="n"&gt;id&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="n"&gt;get_global_id&lt;/SPAN&gt;&lt;SPAN class="p"&gt;(0);&amp;nbsp;&amp;nbsp; //can be extended for multiple dimensions&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="line" id="LC95"&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN class="n"&gt;dst&lt;/SPAN&gt;&lt;SPAN class="p"&gt;[&lt;/SPAN&gt;&lt;SPAN class="n"&gt;id&lt;/SPAN&gt;&lt;SPAN class="p"&gt;]&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="n"&gt;src&lt;/SPAN&gt;&lt;SPAN class="p"&gt;[&lt;/SPAN&gt;&lt;SPAN class="n"&gt;id&lt;/SPAN&gt;&lt;SPAN class="p"&gt;];&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV class="line" id="LC98"&gt;&amp;nbsp;&lt;SPAN class="p"&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;P&gt;Please use big local size (maximum supported is 8K). Please make sure that to create enough working groups (At the bare minimum the number of compute units).&lt;/P&gt;

&lt;P&gt;Please update here with your findings.&lt;/P&gt;

&lt;P&gt;Arik&lt;/P&gt;</description>
      <pubDate>Sun, 01 Dec 2013 11:39:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921275#M13391</guid>
      <dc:creator>Arik_N_Intel</dc:creator>
      <dc:date>2013-12-01T11:39:57Z</dc:date>
    </item>
    <item>
      <title>Thanks for  your answers. I</title>
      <link>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921276#M13392</link>
      <description>&lt;P&gt;Thanks for&amp;nbsp; your answers. I tested a few new things. I know get about 100GB/s using the following kernel that utilizes a block tick:&lt;/P&gt;

&lt;P&gt;AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {&lt;BR /&gt;
	&amp;nbsp; const int bsize = 32;&lt;BR /&gt;
	&amp;nbsp; const int mult = 64;&lt;BR /&gt;
	&amp;nbsp; int id = (get_mesh_id() / bsize)*bsize*mult + get_mesh_id() % bsize;&lt;BR /&gt;
	&amp;nbsp; for(int32_t i=0; i&amp;lt;mult; i++) {&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp; dst[id + i * bsize] = src[id + i * bsize];&lt;BR /&gt;
	&amp;nbsp; }&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;I launch 1024x1024 threads with a work group size between 16 and 1024.&lt;/P&gt;

&lt;P&gt;Arik Narkis, with your approach I get about 80GB/s. Is there an OpenCL kernel somewhere that gets peak bandwidth? I'd really like to start from something like that, I'm not getting anywhere currently.&lt;/P&gt;</description>
      <pubDate>Sun, 01 Dec 2013 16:42:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/OpenCL-and-Bandwidth/m-p/921276#M13392</guid>
      <dc:creator>Sebastian_S_</dc:creator>
      <dc:date>2013-12-01T16:42:38Z</dc:date>
    </item>
  </channel>
</rss>

