OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Hardware Thread / Work-group / Work-Intem relation HD graphics

Mohamed_Amine_BERGAC
494 Views

Hi,

I have a doubt about the mapping of work-items with Hardware threads, in my understanding, each work-item is mapped to one hardware threads, but when I read the Optimization guide I found this Note : 

Work-group size of 16 work-items is enough if you do not ask for SLM. Then each work-group maps to each hardware thread.

in this case a work-group will be mapped to a hardware thread, now I can assume that all computations on the kernel are scalar, my question is: if I use vector operations, is this mapping still correct ? if yes , how this can be done (I guess the compiler scallarize all vector opérations) ?

I'm using OpenCL for intel HD Graphics not CPU.

Thanks in advance,

Mohamed 

0 Kudos
6 Replies
Maxim_S_Intel
Employee
494 Views

Hi,

the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).

You can some find details on the threads/SIMD in the recent preso: http://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

0 Kudos
Mohamed_Amine_BERGAC
494 Views

Hi Maxim

Maxim Shevtsov (Intel) wrote:

the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).

What is a GPU thread ? is it a channel or a Hardware Thread ? 

Thanks,

Mohamed

0 Kudos
Maxim_S_Intel
Employee
494 Views

Hi,

GPU threads are threads that run on Execution Units(EUs) of the Intel HD Graphics. Multiple threads can be scheduled on an EU (for example up to 8 threads in the prev. generation of Intel GPUs) to prevent the EU from sitting idle (say due to latency of the mem request). GPU threads are lightweight and HW-scheduled.

0 Kudos
Mohamed_Amine_BERGAC
494 Views

Hi,

Ah ok, The remaining question is if I define in my kernel, a computation like this : float8 a,b,c; a=11;b=12;c=2; res=mad(a,b,c);

and I will set the Local_size=16 and Global_Size=64; is this MAD operation will be executed as 16 x SIMD8 operation (because it is a hard coded vectorial operation) ? I don't know if hard coding the vectorial operations, will be suited for HD graphics or not ? 

Thanks,

Mohamed

0 Kudos
Maxim_S_Intel
Employee
494 Views

Compiler always tries to vectorize to the widest SIMD it can. Typically 16, so 16 work-items will be packed in the 16-wide SIMD lock-steps.

Thus, if your kernel operates on float8, the code will be executed in 8*SIMD16. From the kernel perspective the orignal vector expressions are executed in the transposed (by means of ArrayOfStructures-to-StructureOfArray transformation which is more SIMD friendly) fashion. Notice that local size of 16 is a bare minimum which gives the compiler a room for this trick.

0 Kudos
Mohamed_Amine_BERGAC
494 Views

Thank you so much for this explanation, now I have clearer idea about how my kernel is executed.

Mohamed

0 Kudos
Reply