- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a doubt about the mapping of work-items with Hardware threads, in my understanding, each work-item is mapped to one hardware threads, but when I read the Optimization guide I found this Note :
Work-group size of 16 work-items is enough if you do not ask for SLM. Then each work-group maps to each hardware thread.
in this case a work-group will be mapped to a hardware thread, now I can assume that all computations on the kernel are scalar, my question is: if I use vector operations, is this mapping still correct ? if yes , how this can be done (I guess the compiler scallarize all vector opérations) ?
I'm using OpenCL for intel HD Graphics not CPU.
Thanks in advance,
Mohamed
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).
You can some find details on the threads/SIMD in the recent preso: http://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Maxim
Maxim Shevtsov (Intel) wrote:
the computations within a GPU thread are not scalar, but SIMDified (typically to the width of 16).
What is a GPU thread ? is it a channel or a Hardware Thread ?
Thanks,
Mohamed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
GPU threads are threads that run on Execution Units(EUs) of the Intel HD Graphics. Multiple threads can be scheduled on an EU (for example up to 8 threads in the prev. generation of Intel GPUs) to prevent the EU from sitting idle (say due to latency of the mem request). GPU threads are lightweight and HW-scheduled.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Ah ok, The remaining question is if I define in my kernel, a computation like this : float8 a,b,c; a=11;b=12;c=2; res=mad(a,b,c);
and I will set the Local_size=16 and Global_Size=64; is this MAD operation will be executed as 16 x SIMD8 operation (because it is a hard coded vectorial operation) ? I don't know if hard coding the vectorial operations, will be suited for HD graphics or not ?
Thanks,
Mohamed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compiler always tries to vectorize to the widest SIMD it can. Typically 16, so 16 work-items will be packed in the 16-wide SIMD lock-steps.
Thus, if your kernel operates on float8, the code will be executed in 8*SIMD16. From the kernel perspective the orignal vector expressions are executed in the transposed (by means of ArrayOfStructures-to-StructureOfArray transformation which is more SIMD friendly) fashion. Notice that local size of 16 is a bare minimum which gives the compiler a room for this trick.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for this explanation, now I have clearer idea about how my kernel is executed.
Mohamed

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page