<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Jeffrey, in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088518#M4859</link>
    <description>&lt;P&gt;Jeffrey,&lt;/P&gt;

&lt;P&gt;The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.&lt;/P&gt;</description>
    <pubDate>Tue, 07 Jun 2016 22:32:37 GMT</pubDate>
    <dc:creator>Robert_I_Intel</dc:creator>
    <dc:date>2016-06-07T22:32:37Z</dc:date>
    <item>
      <title>gpu cache friendly</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088515#M4856</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I wonder how gpu cache works? specifically, I have 2 Qs&lt;/P&gt;

&lt;P&gt;- does gpu HW perform prefetch?&lt;/P&gt;

&lt;P&gt;- in the 2 kernels below, which one is more cache efficient (or not much difference)? By cache efficient, I mean higher transfer rate from cache to EU.&lt;/P&gt;

&lt;P&gt;kernel 1,&lt;/P&gt;

&lt;P&gt;__kernel void foo1(_global uchar* src, __glpobal uchar* dst)&amp;nbsp;&lt;/P&gt;

&lt;P&gt;{&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int i = get_global_id(0);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; dst&lt;I&gt; = src&lt;I&gt; &amp;gt;&amp;gt;1;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;kernel 2,&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;__kernel void foo2(_global uchar16* src, __glpobal uchar16* dst)&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;{&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int i = get_global_id(0);&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; dst&lt;I&gt; = src&lt;I&gt; &amp;gt;&amp;gt; (uchar) 1;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2016 16:00:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088515#M4856</guid>
      <dc:creator>Fu_J_Intel</dc:creator>
      <dc:date>2016-06-07T16:00:23Z</dc:date>
    </item>
    <item>
      <title>Jeffrey,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088516#M4857</link>
      <description>&lt;P&gt;Jeffrey,&lt;/P&gt;

&lt;P&gt;The second version is much more efficient in bringing a data in and writing it out.&lt;/P&gt;

&lt;P&gt;The first version brings only half a cacheline (SIMD32 compilation, 32 bytes out of 64 bytes)&lt;/P&gt;

&lt;P&gt;The second version brings in 8 full cache lines - much better (actually, the best you can do on our hardware).&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2016 18:50:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088516#M4857</guid>
      <dc:creator>Robert_I_Intel</dc:creator>
      <dc:date>2016-06-07T18:50:31Z</dc:date>
    </item>
    <item>
      <title>thanks for nice info, Robert.</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088517#M4858</link>
      <description>&lt;P&gt;thanks for nice info, Robert.&lt;/P&gt;

&lt;P&gt;More questions:&lt;/P&gt;

&lt;P&gt;- does GPU do prefetch?&lt;/P&gt;

&lt;P&gt;- in vtune, "L3 shader bandwidth" of kernel 1 is 28 GB/s and for kernel 2 , the bandwidth is 10GB/s. &amp;nbsp; This looks opposite to my understanding.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2016 20:46:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088517#M4858</guid>
      <dc:creator>Fu_J_Intel</dc:creator>
      <dc:date>2016-06-07T20:46:59Z</dc:date>
    </item>
    <item>
      <title>Jeffrey,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088518#M4859</link>
      <description>&lt;P&gt;Jeffrey,&lt;/P&gt;

&lt;P&gt;The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2016 22:32:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/gpu-cache-friendly/m-p/1088518#M4859</guid>
      <dc:creator>Robert_I_Intel</dc:creator>
      <dc:date>2016-06-07T22:32:37Z</dc:date>
    </item>
  </channel>
</rss>

