topic GEN instruction explanation? in OpenCL* for CPU

GEN instruction explanation?

allanmac1 — Tue, 26 Jul 2016 21:53:56 GMT

I'm storing 8x64-bit quad-words (SIMD8) to SLM and am trying to understand some curious GEN sequences.

The OpenCL line of code in question is a store to a doubly indexed array in SLM:

shared.m0[2][local_id] = r1;

Why does this indexed store to SLM result in 4-6 "mov" operations and two sends?

I assume some MOV operations are necessary to prepare a SEND "message"?

But why are there two SEND ops?

send     (8|M0)         null:ud       r27:ud            0xA       0x40F0020 //  hdc.dc0  wr:2h, rd:0, wr.scrdwfc: 0x70020
send     (8|M0)         null:ud       r59:ud            0xC       0x6026CFE //  hdc.dc1  wr:3, rd:0, wr.usurf msc:44, to SLM

I understand the second SEND but what is the first doing that's necessary? Is it a queue barrier of some sort?

Also, why are there so many MOV operations for this 8x64-bit SIMD8 store?

I took a look at the code

allanmac1 — Thu, 28 Jul 2016 00:49:00 GMT

I took a look at the code generation for SIMD8 x 32-bit global and local loads and stores and it looks nice and compact with typically an ADD (pointer increment), MOV and SEND.

Is this just code generation that needs to improve or would it be beneficial to load/store the low and high 32-bit words of a 64-bit word?

I was assuming that only one SEND operation would be generated for an SIMD8 x 64-bit load/store (64 bytes/clock).

The first send looks to be a

Timothy_B_Intel — Fri, 05 Aug 2016 00:26:16 GMT

The first send looks to be a scratch DWORD write. This typically happens on a spill (out of registers) or when one accesses a private array with a dynamic index

ptr = ...; // where i is a variable (not a constant)

Can you query the value clGetKernelWorkGroupInfo(CL_KERNEL_PRIVATE_MEM_SIZE)?

Can you maybe show us the CL code? Or a small reproducer for that code?

Regards,

- Tim

Thanks... I took a look and

allanmac1 — Fri, 05 Aug 2016 18:01:00 GMT

Thanks, I just took a look and private memory is reported to be 0 whether building a binary or compiling from kernel source at runtime:

kernel info:
    Maximum work-group size: 256
    Compiler work-group size: (0, 0, 0)
    Local memory size: 32704
    Preferred multiple of work-group size: 8
    Minimum amount of private memory: 0

I'll keep digging and simplifying to see if I can squash this bug.

I'll send a reproducer if I don't see an improvement.

Sounds great. Let me know how

Timothy_B_Intel — Wed, 10 Aug 2016 16:52:53 GMT

Sounds great. Let me know how it works out.

Can you show us more of the kernel (OpenCL or assembly)? Specifically, I am interested in the structure type for your local memory and how you access it (more than just that line). There are some other cases where we can "spill", but we can almost always tweak the GPU program to fix that.