OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

GEN instruction explanation?


I'm storing 8x64-bit quad-words (SIMD8) to SLM and am trying to understand some curious GEN sequences.

The OpenCL line of code in question is a store to a doubly indexed array in SLM:

shared.m0[2][local_id] = r1;

Why does this indexed store to SLM result in 4-6 "mov" operations and two sends?

I assume some MOV operations are necessary to prepare a SEND "message"?

But why are there two SEND ops? 

send     (8|M0)         null:ud       r27:ud            0xA       0x40F0020 //  hdc.dc0  wr:2h, rd:0, wr.scrdwfc: 0x70020
send     (8|M0)         null:ud       r59:ud            0xC       0x6026CFE //  hdc.dc1  wr:3, rd:0, wr.usurf msc:44, to SLM

I understand the second SEND but what is the first doing that's necessary?  Is it a queue barrier of some sort?

Also, why are there so many MOV operations for this 8x64-bit SIMD8 store?



0 Kudos
4 Replies

I took a look at the code generation for SIMD8 x 32-bit global and local loads and stores and it looks nice and compact with typically an ADD (pointer increment), MOV and SEND.

Is this just code generation that needs to improve or would it be beneficial to load/store the low and high 32-bit words of a 64-bit word?

I was assuming that only one SEND operation would be generated for an SIMD8 x 64-bit load/store (64 bytes/clock).


The first send looks to be a scratch DWORD write. This typically happens on a spill (out of registers) or when one accesses a private array with a dynamic index

   ptr = ...; // where i is a variable (not a constant)

Can you query the value clGetKernelWorkGroupInfo(CL_KERNEL_PRIVATE_MEM_SIZE)?

Can you maybe show us the CL code? Or a small reproducer for that code?


- Tim


Thanks, I just took a look and private memory is reported to be 0 whether building a binary or compiling from kernel source at runtime:

kernel info:
    Maximum work-group size: 256
    Compiler work-group size: (0, 0, 0)
    Local memory size: 32704
    Preferred multiple of work-group size: 8
    Minimum amount of private memory: 0

I'll keep digging and simplifying to see if I can squash this bug.

I'll send a reproducer if I don't see an improvement.



Sounds great. Let me know how it works out.

Can you show us more of the kernel (OpenCL or assembly)? Specifically, I am interested in the structure type for your local memory and how you access it (more than just that line). There are some other cases where we can "spill", but we can almost always tweak the GPU program to fix that.