<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:A question about data prefetch in kernel programming in GPU Compute Software</title>
    <link>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1470625#M791</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;Intel® UHD Graphics 630 does not have a prefetch. Our newer GPUs do, and on those the prefetch builtin described &lt;A href="https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#async-copies" rel="noopener noreferrer" target="_blank"&gt;here&lt;/A&gt; can be used to prefetch.&lt;/P&gt;&lt;P&gt;Have you tried &lt;A href="https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-0/gemm.html" rel="noopener noreferrer" target="_blank"&gt;oneMKL&lt;/A&gt; as a baseline for your matrix multiplication?&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Dunni&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Tue, 28 Mar 2023 09:13:10 GMT</pubDate>
    <dc:creator>Dunni_A_Intel</dc:creator>
    <dc:date>2023-03-28T09:13:10Z</dc:date>
    <item>
      <title>A question about data prefetch in kernel programming</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1445637#M727</link>
      <description>&lt;P&gt;hi, dear Intel team,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; I'm working on optimizing 1024 x 1024 matrix mulplication on Intel Gen9 GPU. Here is my pseudo code:&lt;/P&gt;
&lt;P&gt;numTiles = 1024 /4&lt;/P&gt;
&lt;P&gt;__local Asub, Bsub, Ctemp&lt;/P&gt;
&lt;P&gt;for t=0, t++, t&amp;lt;NumTiles {&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Asub[4][4] = load 4X4 SP float data from matrix A (using vload4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Bsub[4][4] = load 4X4 SP float data from matrix B&amp;nbsp;(using vload4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; C_temp[4][4] += Asub * Bsub }&lt;/P&gt;
&lt;P&gt;C_sub = C_temp&amp;nbsp;(using vstore4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For one work item, the Asub and Bsub will go through 4 rows of matrix A and 4 columns of matrix B, to get final C_sub 4X4 elements.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;By using Vtune amplifier's "dynamic instruction count" analyzer tool, I found that the 4X4 data loading from global memory to local memory consumes a lot of instructions counts. Could I re-write my code, to do prefetch, to hide the loading latency between global memory and local memory? Maybe like this:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;numTiles = 1024 /4&lt;/P&gt;
&lt;P&gt;int t = 0&lt;/P&gt;
&lt;P&gt;__local Asub_current, Bsub_current, Asub_new, Bsub_new, Ctemp&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Asub_current[4][4] = load 4X4 SP float data from matrix A (using vload4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Bsub_current[4][4] = load 4X4 SP float data from matrix B&amp;nbsp;(using vload4)&lt;/P&gt;
&lt;P&gt;for t=1, t++, t&amp;lt;=NumTiles {&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Asub_new[4][4] = load 4X4 SP float data from matrix A (using vload4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Bsub_new[4][4] = load 4X4 SP float data from matrix B&amp;nbsp;(using vload4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; C_temp[4][4] += Asub_current * Bsub_current&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Asub_current = Asub_new&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Bsub_current = Bsub_new&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;C_sub = C_temp&amp;nbsp;(using vstore4)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; My understanding is, when GPU begins to load Asub_new and Bsub_new, it doesn't need to wait till loading is done, but it could begin mulplication immediately. After the mulplication is done, GPU could load new data into current matrics. Is this possible? If not, how could I program, to achieve "prefetch" to hide data transfer latency?&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Thanks a lot!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; By the way, by using subgroup and subgroup shuffle, I can achieve 313GFLOPS on my UHD 630 GPU for 1024x1024 matrix mulplication, which is 66% of its top performance. Vtune dynamic instruction count analyzer shows that subgroup_block read is much more efficient than vloadn. But I want to avoid using subgroup functions to get eaiser portability. Currently I can achieve 150GFLOPS from my first pseudo code. Still working on it.&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2023 15:27:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1445637#M727</guid>
      <dc:creator>Scout</dc:creator>
      <dc:date>2023-01-11T15:27:20Z</dc:date>
    </item>
    <item>
      <title>Re:A question about data prefetch in kernel programming</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1446033#M728</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for posting in the Intel forums.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you please provide us with the complete reproducer code and steps to reproduce the issue at our end?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you also please provide us with the OS details, and version of Intel oneAPI you have been using?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards&lt;/P&gt;&lt;P&gt;Shivani&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 12 Jan 2023 10:30:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1446033#M728</guid>
      <dc:creator>ShivaniK_Intel</dc:creator>
      <dc:date>2023-01-12T10:30:12Z</dc:date>
    </item>
    <item>
      <title>Re: Re:A question about data prefetch in kernel programming</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1446072#M729</link>
      <description>&lt;P&gt;hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Please see attached file. Just run the make_run.sh.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; The program supports these inputs:&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; -w&amp;nbsp; matrix A width&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; -h&amp;nbsp; matrix A height/matrix B width&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; -s&amp;nbsp; matrix B height&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;-m global work item number in x dimention&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;-n&amp;nbsp; global work item number in y dimention&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;-x&amp;nbsp; local&amp;nbsp;work item number in x dimention&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;-y&amp;nbsp; local&amp;nbsp;work item number in y dimention&lt;/P&gt;
&lt;P&gt;&amp;nbsp; -d&amp;nbsp; display results or not&lt;/P&gt;
&lt;P&gt;&amp;nbsp; -e&amp;nbsp; calculate matrix C gold result or not. (very slow).&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;For example, in the make_run.sh, it runs like this:&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;./hello_world -w 1024 -h 1024 -s 1024 -m 256 -n 256 -x 16 -y 16 -d 0 -e 0&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;Matrix height/width must be 4 times of global item numbler, because each work item computes 4X4 elements of result matrix C.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;Vtune shows that the vload4 consumes a lot of instruction counts:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Scout_0-1673528977059.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/37008i017D3A65725DF453/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Scout_0-1673528977059.png" alt="Scout_0-1673528977059.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; If we use 2 variables to see the detail:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Scout_1-1673529624030.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/37009iEA09D3FC77774D76/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Scout_1-1673529624030.png" alt="Scout_1-1673529624030.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;OS: ubuntu 18.04.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;Kernel: 5.6.15-050615-generic&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;Intel oneAPI is not used, since I just use GCC to compile the code.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Platform info:&lt;/P&gt;
&lt;P&gt;Platform Host timer resolution 1ns&lt;BR /&gt;Platform Extensions function suffix INTEL&lt;/P&gt;
&lt;P&gt;Platform Name Intel(R) CPU Runtime for OpenCL(TM) Applications&lt;BR /&gt;Platform Vendor Intel(R) Corporation&lt;BR /&gt;Platform Version OpenCL 2.1 LINUX&lt;BR /&gt;Platform Profile FULL_PROFILE&lt;BR /&gt;Platform Host timer resolution 1ns&lt;BR /&gt;Platform Extensions function suffix INTEL&lt;/P&gt;
&lt;P&gt;Platform Name Intel(R) OpenCL HD Graphics&lt;BR /&gt;Number of devices 1&lt;BR /&gt;Device Name Intel(R) UHD Graphics 630 [0x9bc8]&lt;BR /&gt;Device Vendor Intel(R) Corporation&lt;BR /&gt;Device Vendor ID 0x8086&lt;BR /&gt;Device Version OpenCL 3.0 NEO &lt;BR /&gt;Driver Version 21.38.21026&lt;BR /&gt;Device OpenCL C Version OpenCL C 3.0 &lt;BR /&gt;Device Type GPU&lt;BR /&gt;Device Profile FULL_PROFILE&lt;BR /&gt;Device Available Yes&lt;BR /&gt;Compiler Available Yes&lt;BR /&gt;Linker Available Yes&lt;BR /&gt;Max compute units 24&lt;BR /&gt;Max clock frequency 1200MHz&lt;BR /&gt;Device Partition (core)&lt;BR /&gt;Max number of sub-devices 0&lt;BR /&gt;Supported partition types None&lt;BR /&gt;Max work item dimensions 3&lt;BR /&gt;Max work item sizes 256x256x256&lt;BR /&gt;Max work group size 256&lt;BR /&gt;Preferred work group size multiple 32&lt;BR /&gt;Max sub-groups per work group 32&lt;BR /&gt;Sub-group sizes (Intel) 8, 16, 3&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;And, one question, is it possible, to do compute and transfer overlap, to hide global memory to local memory transfer latency? I did some research and just found some info about Nvidia device on how to hide system DDR to accelerator card GDDR latency through PCIE.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jan 2023 13:21:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1446072#M729</guid>
      <dc:creator>Scout</dc:creator>
      <dc:date>2023-01-12T13:21:31Z</dc:date>
    </item>
    <item>
      <title>Re:A question about data prefetch in kernel programming</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1470625#M791</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;Intel® UHD Graphics 630 does not have a prefetch. Our newer GPUs do, and on those the prefetch builtin described &lt;A href="https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#async-copies" rel="noopener noreferrer" target="_blank"&gt;here&lt;/A&gt; can be used to prefetch.&lt;/P&gt;&lt;P&gt;Have you tried &lt;A href="https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-0/gemm.html" rel="noopener noreferrer" target="_blank"&gt;oneMKL&lt;/A&gt; as a baseline for your matrix multiplication?&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Dunni&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 28 Mar 2023 09:13:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/A-question-about-data-prefetch-in-kernel-programming/m-p/1470625#M791</guid>
      <dc:creator>Dunni_A_Intel</dc:creator>
      <dc:date>2023-03-28T09:13:10Z</dc:date>
    </item>
  </channel>
</rss>

