- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this correct?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The order in which a work item within a work group gets distributed is
- SIMD unit
- spread accross EUs
- spread accross threads
Hope that makes sense.
For more information, please attendour webinar about Writing Efficient Code for OpenCL Applications on 3rd Generation Intel Core Processors.
Thanks,
Raghu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raghu Muthyalampalli (Intel) wrote:
A workgroup is executed within a half-slice (a collection of EUs). Multiple workgroups can be executed on the same half-slice. So your assumption may or may not be correct.
The order in which a work item within a work group gets distributed is
- SIMD unit
- spread accross EUs
- spread accross threadsHope that makes sense.
For more information, please attendour webinar about Writing Efficient Code for OpenCL Applications on 3rd Generation Intel Core Processors.
Thanks,
Raghu
Hi, there is still confusion in your explanation. What do you mean by across "threads". I assume that the thread you mean here is the intel hardware threads (like the warp or wavefront for nvidia and AMD, respectively), in which 16 workitems are packed. For better explanation, I propose the following example, please clarify me:
suppose global size=64, local size =16, GT2 architecture, one slice, each EU have two 4-wide SIMD units or pipelines, as the second pipeline has some limitations according to https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf, then assume only the first SIMD units works.
so these total 64 workitems will distributes like follows?
EU0_PIPELINE0: (W0, W1, W2, W3), (W16, W17, W18, W19),(W32, W33, W34, W35), (W48, W49, W50, W51)
EU1_PIPELINE0: (W4, W5, W6, W7), (W20, W21, W22, W23),(W36, W37, W38, W39), (W52, W53, W54, W55)
EU2_PIPELINE0: (W8, W9, W10, W11), (W24, W25, W26, W27),(W40, W41, W42, W43), (W56, W57, W58, W59)
EU3_PIPELINE0: (W12, W13, W14, W15), (W28, W29, W30, W31),(W44, W45, W46, W47), (W60, W61, W62, W63)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Biao,
There were some diagrams and animations in the Webinar presentation that didn't translate very well to the PDF. If you haven't see it already, you may want to watch the Webinar recording. This is discussed in detail starting around 9 minutes into Part 1.
To answer your specific question you need to know how your kernel was compiled. You can figure this out by querying CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Let's say that your kernel was compiled SIMD16.
In your example you have global size = 64 and local size = 16. Since your kernel was compiled SIMD16 and your local size is 16, you require one EU thread per work group. Since your global size is 64 and your local size is 16, you require four work groups. At one EU thread per work group, you will require a total of four EU threads to complete your work. Since GT2 has more than four EUs, each of your four EU threads will launch onto a different EU, and the remaining EUs will be idle or available to run subsequent work.
Hope this helps!
-- Ben
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ben Ashbaugh (Intel) wrote:
Hi Biao,
There were some diagrams and animations in the Webinar presentation that didn't translate very well to the PDF. If you haven't see it already, you may want to watch the Webinar recording. This is discussed in detail starting around 9 minutes into Part 1.
To answer your specific question you need to know how your kernel was compiled. You can figure this out by querying CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Let's say that your kernel was compiled SIMD16.
In your example you have global size = 64 and local size = 16. Since your kernel was compiled SIMD16 and your local size is 16, you require one EU thread per work group. Since your global size is 64 and your local size is 16, you require four work groups. At one EU thread per work group, you will require a total of four EU threads to complete your work. Since GT2 has more than four EUs, each of your four EU threads will launch onto a different EU, and the remaining EUs will be idle or available to run subsequent work.
Hope this helps!
-- Ben
Hi, Ben:
After saw your presentation in the webinar, it is much clear now. Really nice job.
Now I also profile my kernel using vtune profiler, very nice tool. However, I would recommend Intel to put more effort in the profiler. One of the key problem is not indicated by it, is the kernel compute bound or memory bound? This is the same question raised against Julia at the end of her presentation "Analyzing OpenCL applications with Intel® VTune™ Amplifier XE", but was not answers clearly.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page