I'm storing 8x64-bit quad-words (SIMD8) to SLM and am trying to understand some curious GEN sequences.
The OpenCL line of code in question is a store to a doubly indexed array in SLM:
shared.m0[local_id] = r1;
Why does this indexed store to SLM result in 4-6 "mov" operations and two sends?
I assume some MOV operations are necessary to prepare a SEND "message"?
But why are there two SEND ops?
send (8|M0) null:ud r27:ud 0xA 0x40F0020 // hdc.dc0 wr:2h, rd:0, wr.scrdwfc: 0x70020 send (8|M0) null:ud r59:ud 0xC 0x6026CFE // hdc.dc1 wr:3, rd:0, wr.usurf msc:44, to SLM
I understand the second SEND but what is the first doing that's necessary? Is it a queue barrier of some sort?
Also, why are there so many MOV operations for this 8x64-bit SIMD8 store?
I took a look at the code generation for SIMD8 x 32-bit global and local loads and stores and it looks nice and compact with typically an ADD (pointer increment), MOV and SEND.
Is this just code generation that needs to improve or would it be beneficial to load/store the low and high 32-bit words of a 64-bit word?
I was assuming that only one SEND operation would be generated for an SIMD8 x 64-bit load/store (64 bytes/clock).
The first send looks to be a scratch DWORD write. This typically happens on a spill (out of registers) or when one accesses a private array with a dynamic index
ptr = ...; // where i is a variable (not a constant)
Can you query the value clGetKernelWorkGroupInfo(CL_KERNEL_PRIVATE_MEM_SIZE)?
Can you maybe show us the CL code? Or a small reproducer for that code?
Thanks, I just took a look and private memory is reported to be 0 whether building a binary or compiling from kernel source at runtime:
kernel info: Maximum work-group size: 256 Compiler work-group size: (0, 0, 0) Local memory size: 32704 Preferred multiple of work-group size: 8 Minimum amount of private memory: 0
I'll keep digging and simplifying to see if I can squash this bug.
I'll send a reproducer if I don't see an improvement.
Sounds great. Let me know how it works out.
Can you show us more of the kernel (OpenCL or assembly)? Specifically, I am interested in the structure type for your local memory and how you access it (more than just that line). There are some other cases where we can "spill", but we can almost always tweak the GPU program to fix that.