Address bits in Xe-LP "send.dc0" instruction

fdw — Wed, 20 Jul 2022 01:20:01 GMT

Consider the following OpenCL kernel:

#pragma OPENCL EXTENSION cl_intel_subgroups : enable #define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p)) #define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v)) __attribute__((intel_reqd_sub_group_size(8))) __kernel void myk(__global float* a) { int g_col = 32*get_global_id(0) / 8; float4 temp = block_read4f(a + g_col); block_write4f(a + g_col, 2*temp); }

On my Tigerlake system running 64-bit Linux the kernel compiles to:

(W) mov (8|M0) r2.0<1>:ud r0.0<1;1,0>:ud (W) or (1|M0) cr0.0<1>:ud cr0.0<0;1,0>:ud 0x4C0:uw {@1} (W) mul (1|M0) acc0.0<1>:d r5.3<0;1,0>:d r2.2<0;1,0>:uw {@1} (W) mach (1|M0) r3.0<1>:d r5.3<0;1,0>:d r2.1<0;1,0>:d (W) mov (8|M0) r127.0<1>:f r2.0<8;8,1>:f {Compacted} add (8|M0) acc0.0<1>:d r3.0<0;1,0>:d r1.0<8;8,1>:uw {Compacted,@2} add (8|M0) acc0.0<1>:d acc0.0<8;8,1>:d r4.0<0;1,0>:d shl (8|M0) r6.0<1>:d acc0.0<8;8,1>:d 4:w add (8|M0) r7.0<1>:d r6.0<8;8,1>:d r5.2<0;1,0>:d {Compacted,@1} (W) mov (1|M0) r12.2<1>:f r7.0<0;1,0>:f {Compacted,@1} (W) shr (1|M0) r17.2<1>:ud r7.0<0;1,0>:ud 0x4:uw (W) send.dc0 (16|M0) r8 r12 null 0x0 0x02484400 {@2,$0} // wr:1h+0, rd:4; oword aligned block read x8 mul (8|M0) r13.0<1>:f r8.0<8;8,1>:f 2.0:f {Compacted,$0.dst} mul (8|M0) r14.0<1>:f r9.0<8;8,1>:f 2.0:f {Compacted} mul (8|M0) r15.0<1>:f r10.0<8;8,1>:f 2.0:f {Compacted} mul (8|M0) r16.0<1>:f r11.0<8;8,1>:f 2.0:f {Compacted} (W) send.dc0 (16|M0) null r17 r13 0x100 0x020A0400 {@1,$1} // wr:1h+4, rd:0; oword block write x8 (W) send.dc0 (8|M0) r18 r2 null 0x0 0x0219E000 {$2} // wr:1h+0, rd:1; synchronized global fence flushing (W) mov (8|M0) null<1>:ud r18.0<8;8,1>:ud {$2.dst} (W) send.ts (8|M0) null r127 null 0x0 0x02000010 {EOT,@1} // wr:1+0, rd:0; end of thread

The key point of interest is the send.dc0 instruction which handles the reads and writes. For the first read the address is in r12. From this I have five questions:

What is the instruction doing with the first two components of r12? Our address appears to land in the third component.
Given the address is 32-bits, what trick am I missing that makes this safe on a 64-bit runtime where I can successfully allocate numerous 4 GiB memory objects?
Why does the write operation shift its address right by four bits? I presume this has something to do with the 16-bit alignment requirement for writes, but am unsure.
What is the significance of the 0x100 immediate in the write?
Going through the compiler it appears as if DC stands for data cache. On some kernels I see some arguments interacted with via dc0 (as above) and others via dc1 (which has an A64 instruction which does take 64-bit addresses). Other than supporting a slightly different set of read/write operations, are there any other major differences between them?

Re:Address bits in Xe-LP "send.dc0" instruction

NoorjahanSk_Intel — Wed, 20 Jul 2022 09:07:02 GMT

Hi,

Thanks for reaching out to us.

We are working on your issue. We will get back to you soon.

Thanks & Regards,

Noorjahan.

topic Re:Address bits in Xe-LP "send.dc0" instruction in GPU Compute Software

Address bits in Xe-LP "send.dc0" instruction

Re:Address bits in Xe-LP "send.dc0" instruction