Solved: Hi Ben,

wu__john3851 · ‎03-14-2018

I look from other topic that:

Each Execution Unit (EU) in our integrated graphics has seven hardware threads, each hardware thread is capable of running 8, 16, or 32 work items depending on whether compiler chose to build your kernel SIMD8, SIMD16 or SIMD32.

is that means when i call get_global_size it will return different value according how the compiler compile the kernel(with SIMD8, SIMD16 or SIMD32)?

Ben_A_Intel · ‎03-16-2018

wu, john3851 wrote:

the SIMD count determind how many work item can be mapped to a EU thread. so if total EU threads can not run total work item parallelly in one time then some work items need to wait until the EU thread finish the former work item.

is my understand right?

I think so, but here's a slightly re-phrased description, which may or may not help:

- If a work group (determined by the local work size) cannot execute entirely within one EU thread due to the compiled SIMD size (AKA the subgorup size), then the work group will be broken across multiple EU threads, which may or may not run on the same physical EU. Note that work items within a work group are guaranteed to execute concurrently by the OpenCL execution model.

- If the NDRange of work groups (determined by the global work size) exceeds the total number of EU threads available in the system, then some work groups will need to wait for previously launched work groups to complete before they can begin executing. This is OK, because the OpenCL execution model doesn't guarantee that any work groups execute concurrently. In practice, of course, on many devices work groups do execute concurrently, for performance reasons.

View solution in original post

Ben_A_Intel · ‎03-14-2018

Hi John, Both the global size and the local size are determined by the application, so the values returned by get_global_size() and get_local_size() won't change if your kernel is compiled SIMD8, SIMD16, or SIMD32. The only directly observable change is the value returned by get_sub_group_size() or get_max_sub_group_size(). Indirectly, the performance of your kernel may change as well, with a different SIMD size (or equivalently, a different subgroup size). Here's a presentation I gave with pictures describing how OpenCL workloads execute on our GPUs. It's a bit old, but the concepts are still accurate, even for our newer devices. https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

wu__john3851 · ‎03-14-2018

Hi Ben,

i understand as the total global size is fix, then the total work item count is fix. and the work item will be mapped to the EU thread. the SIMD count determind how many work item can be mapped to a EU thread. so if total EU threads can not run total work item parallelly in one time then some work items need to wait until the EU thread finish the former work item.

is my understand right?

wu__john3851 · ‎03-15-2018