OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

how many work item that a EU have

wu__john3851
Beginner
859 Views

I look from other topic that:

Each Execution Unit (EU) in our integrated graphics has seven hardware threads, each hardware thread is capable of running 8, 16, or 32 work items depending on whether compiler chose to build your kernel SIMD8, SIMD16 or SIMD32.

is that means when i call get_global_size it will return different value according how the compiler compile the kernel(with SIMD8, SIMD16 or SIMD32​)?

 

0 Kudos
1 Solution
Ben_A_Intel
Employee
859 Views

wu, john3851 wrote:

the SIMD count determind how many work item can be mapped to a EU thread. so if total EU threads can not run total work item parallelly in one time then some work items need to wait until the EU thread finish the former work item.

is my understand right?

I think so, but here's a slightly re-phrased description, which may or may not help:

- If a work group (determined by the local work size) cannot execute entirely within one EU thread due to the compiled SIMD size (AKA the subgorup size), then the work group will be broken across multiple EU threads, which may or may not run on the same physical EU.  Note that work items within a work group are guaranteed to execute concurrently by the OpenCL execution model.

- If the NDRange of work groups (determined by the global work size) exceeds the total number of EU threads available in the system, then some work groups will need to wait for previously launched work groups to complete before they can begin executing.  This is OK, because the OpenCL execution model doesn't guarantee that any work groups execute concurrently.  In practice, of course, on many devices work groups do execute concurrently, for performance reasons.

View solution in original post

0 Kudos
4 Replies
Ben_A_Intel
Employee
859 Views
Hi John, Both the global size and the local size are determined by the application, so the values returned by get_global_size() and get_local_size() won't change if your kernel is compiled SIMD8, SIMD16, or SIMD32. The only directly observable change is the value returned by get_sub_group_size() or get_max_sub_group_size(). Indirectly, the performance of your kernel may change as well, with a different SIMD size (or equivalently, a different subgroup size). Here's a presentation I gave with pictures describing how OpenCL workloads execute on our GPUs. It's a bit old, but the concepts are still accurate, even for our newer devices. https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf
0 Kudos
wu__john3851
Beginner
857 Views

Hi Ben,

i understand as the total global size is fix, then the total work item count is fix. and the work item will be mapped to the EU thread. the SIMD count determind how many work item can be mapped to a EU thread. so if total EU threads can not run total work item parallelly in one time then some work items need to wait until the EU thread finish the former work item.

is my understand right?

0 Kudos
wu__john3851
Beginner
857 Views

Hi Ben,

i understand as the total global size is fix, then the total work item count is fix. and the work item will be mapped to the EU thread. the SIMD count determind how many work item can be mapped to a EU thread. so if total EU threads can not run total work item parallelly in one time then some work items need to wait until the EU thread finish the former work item.

is my understand right?

 

Ben Ashbaugh (Intel) wrote:

Hi John,

Both the global size and the local size are determined by the application, so the values returned by get_global_size() and get_local_size() won't change if your kernel is compiled SIMD8, SIMD16, or SIMD32. The only directly observable change is the value returned by get_sub_group_size() or get_max_sub_group_size(). Indirectly, the performance of your kernel may change as well, with a different SIMD size (or equivalently, a different subgroup size).

Here's a presentation I gave with pictures describing how OpenCL workloads execute on our GPUs. It's a bit old, but the concepts are still accurate, even for our newer devices.
https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-t...

0 Kudos
Ben_A_Intel
Employee
860 Views

wu, john3851 wrote:

the SIMD count determind how many work item can be mapped to a EU thread. so if total EU threads can not run total work item parallelly in one time then some work items need to wait until the EU thread finish the former work item.

is my understand right?

I think so, but here's a slightly re-phrased description, which may or may not help:

- If a work group (determined by the local work size) cannot execute entirely within one EU thread due to the compiled SIMD size (AKA the subgorup size), then the work group will be broken across multiple EU threads, which may or may not run on the same physical EU.  Note that work items within a work group are guaranteed to execute concurrently by the OpenCL execution model.

- If the NDRange of work groups (determined by the global work size) exceeds the total number of EU threads available in the system, then some work groups will need to wait for previously launched work groups to complete before they can begin executing.  This is OK, because the OpenCL execution model doesn't guarantee that any work groups execute concurrently.  In practice, of course, on many devices work groups do execute concurrently, for performance reasons.

0 Kudos
Reply