topic Compiler always tries to in OpenCL* for CPU

Hardware Thread / Work-group / Work-Intem relation HD graphics

Mohamed_Amine_BERGAC — Fri, 27 Sep 2013 13:45:29 GMT

Hi,

I have a doubt about the mapping of work-items with Hardware threads, in my understanding, each work-item is mapped to one hardware threads, but when I read the Optimization guide I found this Note :

Work-group size of 16 work-items is enough if you do not ask for SLM. Then each work-group maps to each hardware thread.

in this case a work-group will be mapped to a hardware thread, now I can assume that all computations on the kernel are scalar, my question is: if I use vector operations, is this mapping still correct ? if yes , how this can be done (I guess the compiler scallarize all vector opérations) ?

I'm using OpenCL for intel HD Graphics not CPU.

Thanks in advance,

Mohamed

Hi,

Maxim_S_Intel — Fri, 27 Sep 2013 15:52:52 GMT

Hi,

the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).

You can some find details on the threads/SIMD in the recent preso: http://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

Hi Maxim

Mohamed_Amine_BERGAC — Mon, 30 Sep 2013 07:42:33 GMT

Hi Maxim

Maxim Shevtsov (Intel) wrote:

the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).

What is a GPU thread ? is it a channel or a Hardware Thread ?

Thanks,

Mohamed

Hi,

Maxim_S_Intel — Mon, 30 Sep 2013 10:27:01 GMT

Hi,

GPU threads are threads that run on Execution Units(EUs) of the Intel HD Graphics. Multiple threads can be scheduled on an EU (for example up to 8 threads in the prev. generation of Intel GPUs) to prevent the EU from sitting idle (say due to latency of the mem request). GPU threads are lightweight and HW-scheduled.

Hi,

Mohamed_Amine_BERGAC — Mon, 30 Sep 2013 10:38:46 GMT

Hi,

Ah ok, The remaining question is if I define in my kernel, a computation like this : float8 a,b,c; a=11;b=12;c=2; res=mad(a,b,c);

and I will set the Local_size=16 and Global_Size=64; is this MAD operation will be executed as 16 x SIMD8 operation (because it is a hard coded vectorial operation) ? I don't know if hard coding the vectorial operations, will be suited for HD graphics or not ?

Thanks,

Mohamed

Compiler always tries to

Maxim_S_Intel — Mon, 30 Sep 2013 12:03:00 GMT

Compiler always tries to vectorize to the widest SIMD it can. Typically 16, so 16 work-items will be packed in the 16-wide SIMD lock-steps.

Thus, if your kernel operates on float8, the code will be executed in 8*SIMD16. From the kernel perspective the orignal vector expressions are executed in the transposed (by means of ArrayOfStructures-to-StructureOfArray transformation which is more SIMD friendly) fashion. Notice that local size of 16 is a bare minimum which gives the compiler a room for this trick.

Thank you so much for this

Mohamed_Amine_BERGAC — Mon, 30 Sep 2013 12:48:55 GMT

Thank you so much for this explanation, now I have clearer idea about how my kernel is executed.

Mohamed