- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider the following OpenCL kernel:
#pragma OPENCL EXTENSION cl_intel_subgroups : enable
#define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p))
#define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v))
__attribute__((intel_reqd_sub_group_size(8)))
__kernel void
myk(__global float* a)
{
int g_col = 32*get_global_id(0) / 8;
float4 temp = block_read4f(a + g_col);
block_write4f(a + g_col, 2*temp);
}
On my Tigerlake system running 64-bit Linux the kernel compiles to:
(W) mov (8|M0) r2.0<1>:ud r0.0<1;1,0>:ud
(W) or (1|M0) cr0.0<1>:ud cr0.0<0;1,0>:ud 0x4C0:uw {@1}
(W) mul (1|M0) acc0.0<1>:d r5.3<0;1,0>:d r2.2<0;1,0>:uw {@1}
(W) mach (1|M0) r3.0<1>:d r5.3<0;1,0>:d r2.1<0;1,0>:d
(W) mov (8|M0) r127.0<1>:f r2.0<8;8,1>:f {Compacted}
add (8|M0) acc0.0<1>:d r3.0<0;1,0>:d r1.0<8;8,1>:uw {Compacted,@2}
add (8|M0) acc0.0<1>:d acc0.0<8;8,1>:d r4.0<0;1,0>:d
shl (8|M0) r6.0<1>:d acc0.0<8;8,1>:d 4:w
add (8|M0) r7.0<1>:d r6.0<8;8,1>:d r5.2<0;1,0>:d {Compacted,@1}
(W) mov (1|M0) r12.2<1>:f r7.0<0;1,0>:f {Compacted,@1}
(W) shr (1|M0) r17.2<1>:ud r7.0<0;1,0>:ud 0x4:uw
(W) send.dc0 (16|M0) r8 r12 null 0x0 0x02484400 {@2,$0} // wr:1h+0, rd:4; oword aligned block read x8
mul (8|M0) r13.0<1>:f r8.0<8;8,1>:f 2.0:f {Compacted,$0.dst}
mul (8|M0) r14.0<1>:f r9.0<8;8,1>:f 2.0:f {Compacted}
mul (8|M0) r15.0<1>:f r10.0<8;8,1>:f 2.0:f {Compacted}
mul (8|M0) r16.0<1>:f r11.0<8;8,1>:f 2.0:f {Compacted}
(W) send.dc0 (16|M0) null r17 r13 0x100 0x020A0400 {@1,$1} // wr:1h+4, rd:0; oword block write x8
(W) send.dc0 (8|M0) r18 r2 null 0x0 0x0219E000 {$2} // wr:1h+0, rd:1; synchronized global fence flushing
(W) mov (8|M0) null<1>:ud r18.0<8;8,1>:ud {$2.dst}
(W) send.ts (8|M0) null r127 null 0x0 0x02000010 {EOT,@1} // wr:1+0, rd:0; end of thread
The key point of interest is the send.dc0 instruction which handles the reads and writes. For the first read the address is in r12. From this I have five questions:
- What is the instruction doing with the first two components of r12? Our address appears to land in the third component.
- Given the address is 32-bits, what trick am I missing that makes this safe on a 64-bit runtime where I can successfully allocate numerous 4 GiB memory objects?
- Why does the write operation shift its address right by four bits? I presume this has something to do with the 16-bit alignment requirement for writes, but am unsure.
- What is the significance of the 0x100 immediate in the write?
- Going through the compiler it appears as if DC stands for data cache. On some kernels I see some arguments interacted with via dc0 (as above) and others via dc1 (which has an A64 instruction which does take 64-bit addresses). Other than supporting a slightly different set of read/write operations, are there any other major differences between them?
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
We are working on your issue. We will get back to you soon.
Thanks & Regards,
Noorjahan.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page