OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

why SIMD width is determined by work group size?

Fu_J_Intel
Employee
384 Views

when workgroup size is set to one,  I noticed  compiler also sets SIMD width  to be 1.   why is the case?  I mean, is it possible to put work items of DIFFERENT workgroups into one SIMD operation, such as lane 0 is for workgroup 0, lane 0 is for workgroup 1, etc?

thanks,

Jeffrey

0 Kudos
3 Replies
Robert_I_Intel
Employee
384 Views

Jeffrey,

Workgroups either map to hardware threads, can span multiple threads or even span EUs, but we don't put two workgroups on the same hardware thread, so the minimum recommended work group size is 8 (or (4,2) or (2, 4) or (8, 1) or (1, 8) in a 2D case - you can figure out 3D case yourself :) ).

0 Kudos
Fu_J_Intel
Employee
384 Views

Hi Robert,

- by "we don't put two workgroups on the same hardware thread",  did you mean "we don't put two workgroups on the same SIMD"?

- In modulate_v2_uchar16(),  if workgroup_size=32, then each hardware thread processes (32*16) work items. They are from 16 DIFFERENT work groups.  Am I right? 

 

0 Kudos
Robert_I_Intel
Employee
384 Views

One hardware thread can execute 8 work items (SIMD8), 16 work items (SIMD16) or 32 work items (SIMD32). If you have a work group of size 32 and your code was compiled SIMD8, your workgroup will span 4 hardware threads, in case your code was compiled SIMD16, your work group will span 2 hardware threads, and in case you built your code SIMD32, only one thread is required to fit a workgroup.

What you cannot do is build your code SIMD32 and have a work group of size 16 and expect two workgroups to run on the same thread: doesn't work.

in modulate_v2_uchar16, your code is built SIMD32, so there will be 32 work items on one hardware thread, each work item processing 16 pixels. Don't confuse pixels with work items. One work item occupies one SIMD lane but processes 16 pixels. In this case there are 32 work items on a hardware thread.

0 Kudos
Reply