Mapping of OpenCL work groups to EUs

rahul_garg · ‎07-05-2012

I wanted to clarify how work groups are mapped to HD 4000 hardware. My understanding is that a workgroup maps to a single EU, and each EU can run multiple workgroups in parallel.

Is this correct?

Raghupathi_M_Intel · ‎07-06-2012

A workgroup is executed within a half-slice (a collection of EUs). Multiple workgroups can be executed on the same half-slice. So your assumption may or may not be correct.

The order in which a work item within a work group gets distributed is
- SIMD unit
- spread accross EUs
- spread accross threads

Hope that makes sense.

For more information, please attendour webinar about Writing Efficient Code for OpenCL Applications on 3rd Generation Intel Core Processors.

Thanks,
Raghu

rahul_garg · ‎07-06-2012

Interesting. It is not 100% clear to me right now, but I will look at the docs again and attend the webinar and potentially come back to this question later if it is still not clear :)

Biao_W_ · ‎12-16-2013

Raghu Muthyalampalli (Intel) wrote:

A workgroup is executed within a half-slice (a collection of EUs). Multiple workgroups can be executed on the same half-slice. So your assumption may or may not be correct.

The order in which a work item within a work group gets distributed is
- SIMD unit
- spread accross EUs
- spread accross threads

Hope that makes sense.

For more information, please attendour webinar about Writing Efficient Code for OpenCL Applications on 3rd Generation Intel Core Processors.

Thanks,
Raghu

Hi, there is still confusion in your explanation. What do you mean by across "threads". I assume that the thread you mean here is the intel hardware threads (like the warp or wavefront for nvidia and AMD, respectively), in which 16 workitems are packed. For better explanation, I propose the following example, please clarify me:

suppose global size=64, local size =16, GT2 architecture, one slice, each EU have two 4-wide SIMD units or pipelines, as the second pipeline has some limitations according to https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf, then assume only the first SIMD units works.

so these total 64 workitems will distributes like follows?

EU0_PIPELINE0: (W0, W1, W2, W3), (W16, W17, W18, W19),(W32, W33, W34, W35), (W48, W49, W50, W51)

EU1_PIPELINE0: (W4, W5, W6, W7), (W20, W21, W22, W23),(W36, W37, W38, W39), (W52, W53, W54, W55)

EU2_PIPELINE0: (W8, W9, W10, W11), (W24, W25, W26, W27),(W40, W41, W42, W43), (W56, W57, W58, W59)

EU3_PIPELINE0: (W12, W13, W14, W15), (W28, W29, W30, W31),(W44, W45, W46, W47), (W60, W61, W62, W63)

Ben_A_Intel · ‎12-20-2013

Hi Biao,

There were some diagrams and animations in the Webinar presentation that didn't translate very well to the PDF. If you haven't see it already, you may want to watch the Webinar recording. This is discussed in detail starting around 9 minutes into Part 1.

To answer your specific question you need to know how your kernel was compiled. You can figure this out by querying CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Let's say that your kernel was compiled SIMD16.

In your example you have global size = 64 and local size = 16. Since your kernel was compiled SIMD16 and your local size is 16, you require one EU thread per work group. Since your global size is 64 and your local size is 16, you require four work groups. At one EU thread per work group, you will require a total of four EU threads to complete your work. Since GT2 has more than four EUs, each of your four EU threads will launch onto a different EU, and the remaining EUs will be idle or available to run subsequent work.

Hope this helps!

-- Ben

Biao_W_ · ‎01-03-2014

Ben Ashbaugh (Intel) wrote:

Hi Biao,

There were some diagrams and animations in the Webinar presentation that didn't translate very well to the PDF. If you haven't see it already, you may want to watch the Webinar recording. This is discussed in detail starting around 9 minutes into Part 1.

To answer your specific question you need to know how your kernel was compiled. You can figure this out by querying CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Let's say that your kernel was compiled SIMD16.

In your example you have global size = 64 and local size = 16. Since your kernel was compiled SIMD16 and your local size is 16, you require one EU thread per work group. Since your global size is 64 and your local size is 16, you require four work groups. At one EU thread per work group, you will require a total of four EU threads to complete your work. Since GT2 has more than four EUs, each of your four EU threads will launch onto a different EU, and the remaining EUs will be idle or available to run subsequent work.

Hope this helps!

-- Ben

Hi, Ben:

After saw your presentation in the webinar, it is much clear now. Really nice job.

Now I also profile my kernel using vtune profiler, very nice tool. However, I would recommend Intel to put more effort in the profiler. One of the key problem is not indicated by it, is the kernel compute bound or memory bound? This is the same question raised against Julia at the end of her presentation "Analyzing OpenCL applications with Intel® VTune™ Amplifier XE", but was not answers clearly.