Address bits in Xe-LP "send.dc0" instruction

fdw · ‎07-19-2022

Consider the following OpenCL kernel:

#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p))
#define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v))

__attribute__((intel_reqd_sub_group_size(8)))
__kernel void
myk(__global float* a)
{
    int g_col = 32*get_global_id(0) / 8;
    float4 temp = block_read4f(a + g_col);
    block_write4f(a + g_col, 2*temp);
}

On my Tigerlake system running 64-bit Linux the kernel compiles to:

(W)     mov (8|M0)               r2.0<1>:ud    r0.0<1;1,0>:ud                  
(W)     or (1|M0)                cr0.0<1>:ud   cr0.0<0;1,0>:ud   0x4C0:uw              {@1}
(W)     mul (1|M0)               acc0.0<1>:d   r5.3<0;1,0>:d     r2.2<0;1,0>:uw   {@1}
(W)     mach (1|M0)              r3.0<1>:d     r5.3<0;1,0>:d     r2.1<0;1,0>:d   
(W)     mov (8|M0)               r127.0<1>:f   r2.0<8;8,1>:f                    {Compacted}
        add (8|M0)               acc0.0<1>:d   r3.0<0;1,0>:d     r1.0<8;8,1>:uw   {Compacted,@2}
        add (8|M0)               acc0.0<1>:d   acc0.0<8;8,1>:d   r4.0<0;1,0>:d   
        shl (8|M0)               r6.0<1>:d     acc0.0<8;8,1>:d   4:w              
        add (8|M0)               r7.0<1>:d     r6.0<8;8,1>:d     r5.2<0;1,0>:d    {Compacted,@1}
(W)     mov (1|M0)               r12.2<1>:f    r7.0<0;1,0>:f                    {Compacted,@1}
(W)     shr (1|M0)               r17.2<1>:ud   r7.0<0;1,0>:ud    0x4:uw             
(W)     send.dc0 (16|M0)         r8       r12     null    0x0            0x02484400           {@2,$0} // wr:1h+0, rd:4; oword aligned block read x8
        mul (8|M0)               r13.0<1>:f    r8.0<8;8,1>:f     2.0:f               {Compacted,$0.dst}
        mul (8|M0)               r14.0<1>:f    r9.0<8;8,1>:f     2.0:f               {Compacted}
        mul (8|M0)               r15.0<1>:f    r10.0<8;8,1>:f    2.0:f               {Compacted}
        mul (8|M0)               r16.0<1>:f    r11.0<8;8,1>:f    2.0:f               {Compacted}
(W)     send.dc0 (16|M0)         null     r17     r13     0x100            0x020A0400           {@1,$1} // wr:1h+4, rd:0; oword block write x8
(W)     send.dc0 (8|M0)          r18      r2      null    0x0            0x0219E000           {$2} // wr:1h+0, rd:1; synchronized global fence flushing
(W)     mov (8|M0)               null<1>:ud    r18.0<8;8,1>:ud                  {$2.dst}
(W)     send.ts (8|M0)           null     r127    null    0x0            0x02000010           {EOT,@1} // wr:1+0, rd:0; end of thread

The key point of interest is the send.dc0 instruction which handles the reads and writes. For the first read the address is in r12. From this I have five questions:

What is the instruction doing with the first two components of r12? Our address appears to land in the third component.
Given the address is 32-bits, what trick am I missing that makes this safe on a 64-bit runtime where I can successfully allocate numerous 4 GiB memory objects?
Why does the write operation shift its address right by four bits? I presume this has something to do with the 16-bit alignment requirement for writes, but am unsure.
What is the significance of the 0x100 immediate in the write?
Going through the compiler it appears as if DC stands for data cache. On some kernels I see some arguments interacted with via dc0 (as above) and others via dc1 (which has an A64 instruction which does take 64-bit addresses). Other than supporting a slightly different set of read/write operations, are there any other major differences between them?

NoorjahanSk_Intel · ‎07-20-2022

Hi,

Thanks for reaching out to us.

We are working on your issue. We will get back to you soon.

Thanks & Regards,

Noorjahan.

Address bits in Xe-LP "send.dc0" instruction

OpenCL* for GPU