<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Address bits in Xe-LP &amp;quot;send.dc0&amp;quot; instruction in GPU Compute Software</title>
    <link>https://community.intel.com/t5/GPU-Compute-Software/Address-bits-in-Xe-LP-quot-send-dc0-quot-instruction/m-p/1401865#M567</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;&lt;P&gt;We are working on your issue. We will get back to you soon.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Noorjahan.&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Wed, 20 Jul 2022 09:07:02 GMT</pubDate>
    <dc:creator>NoorjahanSk_Intel</dc:creator>
    <dc:date>2022-07-20T09:07:02Z</dc:date>
    <item>
      <title>Address bits in Xe-LP "send.dc0" instruction</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/Address-bits-in-Xe-LP-quot-send-dc0-quot-instruction/m-p/1401746#M565</link>
      <description>&lt;P&gt;Consider the following OpenCL kernel:&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p))
#define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v))

__attribute__((intel_reqd_sub_group_size(8)))
__kernel void
myk(__global float* a)
{
    int g_col = 32*get_global_id(0) / 8;
    float4 temp = block_read4f(a + g_col);
    block_write4f(a + g_col, 2*temp);
}
&lt;/LI-CODE&gt;
&lt;P&gt;On my Tigerlake system running 64-bit Linux the kernel compiles to:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;(W)     mov (8|M0)               r2.0&amp;lt;1&amp;gt;:ud    r0.0&amp;lt;1;1,0&amp;gt;:ud                  
(W)     or (1|M0)                cr0.0&amp;lt;1&amp;gt;:ud   cr0.0&amp;lt;0;1,0&amp;gt;:ud   0x4C0:uw              {@1}
(W)     mul (1|M0)               acc0.0&amp;lt;1&amp;gt;:d   r5.3&amp;lt;0;1,0&amp;gt;:d     r2.2&amp;lt;0;1,0&amp;gt;:uw   {@1}
(W)     mach (1|M0)              r3.0&amp;lt;1&amp;gt;:d     r5.3&amp;lt;0;1,0&amp;gt;:d     r2.1&amp;lt;0;1,0&amp;gt;:d   
(W)     mov (8|M0)               r127.0&amp;lt;1&amp;gt;:f   r2.0&amp;lt;8;8,1&amp;gt;:f                    {Compacted}
        add (8|M0)               acc0.0&amp;lt;1&amp;gt;:d   r3.0&amp;lt;0;1,0&amp;gt;:d     r1.0&amp;lt;8;8,1&amp;gt;:uw   {Compacted,@2}
        add (8|M0)               acc0.0&amp;lt;1&amp;gt;:d   acc0.0&amp;lt;8;8,1&amp;gt;:d   r4.0&amp;lt;0;1,0&amp;gt;:d   
        shl (8|M0)               r6.0&amp;lt;1&amp;gt;:d     acc0.0&amp;lt;8;8,1&amp;gt;:d   4:w              
        add (8|M0)               r7.0&amp;lt;1&amp;gt;:d     r6.0&amp;lt;8;8,1&amp;gt;:d     r5.2&amp;lt;0;1,0&amp;gt;:d    {Compacted,@1}
(W)     mov (1|M0)               r12.2&amp;lt;1&amp;gt;:f    r7.0&amp;lt;0;1,0&amp;gt;:f                    {Compacted,@1}
(W)     shr (1|M0)               r17.2&amp;lt;1&amp;gt;:ud   r7.0&amp;lt;0;1,0&amp;gt;:ud    0x4:uw             
(W)     send.dc0 (16|M0)         r8       r12     null    0x0            0x02484400           {@2,$0} // wr:1h+0, rd:4; oword aligned block read x8
        mul (8|M0)               r13.0&amp;lt;1&amp;gt;:f    r8.0&amp;lt;8;8,1&amp;gt;:f     2.0:f               {Compacted,$0.dst}
        mul (8|M0)               r14.0&amp;lt;1&amp;gt;:f    r9.0&amp;lt;8;8,1&amp;gt;:f     2.0:f               {Compacted}
        mul (8|M0)               r15.0&amp;lt;1&amp;gt;:f    r10.0&amp;lt;8;8,1&amp;gt;:f    2.0:f               {Compacted}
        mul (8|M0)               r16.0&amp;lt;1&amp;gt;:f    r11.0&amp;lt;8;8,1&amp;gt;:f    2.0:f               {Compacted}
(W)     send.dc0 (16|M0)         null     r17     r13     0x100            0x020A0400           {@1,$1} // wr:1h+4, rd:0; oword block write x8
(W)     send.dc0 (8|M0)          r18      r2      null    0x0            0x0219E000           {$2} // wr:1h+0, rd:1; synchronized global fence flushing
(W)     mov (8|M0)               null&amp;lt;1&amp;gt;:ud    r18.0&amp;lt;8;8,1&amp;gt;:ud                  {$2.dst}
(W)     send.ts (8|M0)           null     r127    null    0x0            0x02000010           {EOT,@1} // wr:1+0, rd:0; end of thread&lt;/LI-CODE&gt;
&lt;P&gt;The key point of interest is the send.dc0 instruction which handles the reads and writes.&amp;nbsp; For the first read the address is in r12.&amp;nbsp; From this I have five questions:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;What is the instruction doing with the first two components of r12?&amp;nbsp; Our address appears to land in the third component.&lt;/LI&gt;
&lt;LI&gt;Given the address is 32-bits, what trick am I missing that makes this safe on a 64-bit runtime where I can successfully allocate numerous 4 GiB memory objects?&lt;/LI&gt;
&lt;LI&gt;Why does the write operation shift its address right by four bits?&amp;nbsp; I presume this has something to do with the 16-bit alignment requirement for writes, but am unsure.&lt;/LI&gt;
&lt;LI&gt;What is the significance of the 0x100 immediate in the write?&lt;/LI&gt;
&lt;LI&gt;Going through the compiler it appears as if DC stands for data cache.&amp;nbsp; On some kernels I see some arguments interacted with via dc0 (as above) and others via dc1 (which has an A64 instruction which does take 64-bit addresses).&amp;nbsp; Other than supporting a slightly different set of read/write operations, are there any other major differences between them?&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 20 Jul 2022 01:20:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/Address-bits-in-Xe-LP-quot-send-dc0-quot-instruction/m-p/1401746#M565</guid>
      <dc:creator>fdw</dc:creator>
      <dc:date>2022-07-20T01:20:01Z</dc:date>
    </item>
    <item>
      <title>Re:Address bits in Xe-LP "send.dc0" instruction</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/Address-bits-in-Xe-LP-quot-send-dc0-quot-instruction/m-p/1401865#M567</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;&lt;P&gt;We are working on your issue. We will get back to you soon.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Noorjahan.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 20 Jul 2022 09:07:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/Address-bits-in-Xe-LP-quot-send-dc0-quot-instruction/m-p/1401865#M567</guid>
      <dc:creator>NoorjahanSk_Intel</dc:creator>
      <dc:date>2022-07-20T09:07:02Z</dc:date>
    </item>
  </channel>
</rss>

