Hardware Thread / Work-group / Work-Intem relation HD graphics

Mohamed_Amine_BERGAC · ‎09-27-2013

Hi,

I have a doubt about the mapping of work-items with Hardware threads, in my understanding, each work-item is mapped to one hardware threads, but when I read the Optimization guide I found this Note :

Work-group size of 16 work-items is enough if you do not ask for SLM. Then each work-group maps to each hardware thread.

in this case a work-group will be mapped to a hardware thread, now I can assume that all computations on the kernel are scalar, my question is: if I use vector operations, is this mapping still correct ? if yes , how this can be done (I guess the compiler scallarize all vector opérations) ?

I'm using OpenCL for intel HD Graphics not CPU.

Thanks in advance,

Mohamed

Maxim_S_Intel · ‎09-27-2013

Hi,

the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).

You can some find details on the threads/SIMD in the recent preso: http://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

Mohamed_Amine_BERGAC · ‎09-30-2013

Hi Maxim

Maxim Shevtsov (Intel) wrote:

the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).

What is a GPU thread ? is it a channel or a Hardware Thread ?

Thanks,

Mohamed

Maxim_S_Intel · ‎09-30-2013

Hi,

GPU threads are threads that run on Execution Units(EUs) of the Intel HD Graphics. Multiple threads can be scheduled on an EU (for example up to 8 threads in the prev. generation of Intel GPUs) to prevent the EU from sitting idle (say due to latency of the mem request). GPU threads are lightweight and HW-scheduled.

Mohamed_Amine_BERGAC · ‎09-30-2013

Hi,

Ah ok, The remaining question is if I define in my kernel, a computation like this : float8 a,b,c; a=11;b=12;c=2; res=mad(a,b,c);

and I will set the Local_size=16 and Global_Size=64; is this MAD operation will be executed as 16 x SIMD8 operation (because it is a hard coded vectorial operation) ? I don't know if hard coding the vectorial operations, will be suited for HD graphics or not ?

Thanks,

Mohamed

Maxim_S_Intel · ‎09-30-2013

Compiler always tries to vectorize to the widest SIMD it can. Typically 16, so 16 work-items will be packed in the 16-wide SIMD lock-steps.

Thus, if your kernel operates on float8, the code will be executed in 8*SIMD16. From the kernel perspective the orignal vector expressions are executed in the transposed (by means of ArrayOfStructures-to-StructureOfArray transformation which is more SIMD friendly) fashion. Notice that local size of 16 is a bare minimum which gives the compiler a room for this trick.

Mohamed_Amine_BERGAC · ‎09-30-2013

Thank you so much for this explanation, now I have clearer idea about how my kernel is executed.

Mohamed