when workgroup size is set to one, I noticed compiler also sets SIMD width to be 1. why is the case? I mean, is it possible to put work items of DIFFERENT workgroups into one SIMD operation, such as lane 0 is for workgroup 0, lane 0 is for workgroup 1, etc?
Workgroups either map to hardware threads, can span multiple threads or even span EUs, but we don't put two workgroups on the same hardware thread, so the minimum recommended work group size is 8 (or (4,2) or (2, 4) or (8, 1) or (1, 8) in a 2D case - you can figure out 3D case yourself :) ).
- by "we don't put two workgroups on the same hardware thread", did you mean "we don't put two workgroups on the same SIMD"?
- In modulate_v2_uchar16(), if workgroup_size=32, then each hardware thread processes (32*16) work items. They are from 16 DIFFERENT work groups. Am I right?
One hardware thread can execute 8 work items (SIMD8), 16 work items (SIMD16) or 32 work items (SIMD32). If you have a work group of size 32 and your code was compiled SIMD8, your workgroup will span 4 hardware threads, in case your code was compiled SIMD16, your work group will span 2 hardware threads, and in case you built your code SIMD32, only one thread is required to fit a workgroup.
What you cannot do is build your code SIMD32 and have a work group of size 16 and expect two workgroups to run on the same thread: doesn't work.
in modulate_v2_uchar16, your code is built SIMD32, so there will be 32 work items on one hardware thread, each work item processing 16 pixels. Don't confuse pixels with work items. One work item occupies one SIMD lane but processes 16 pixels. In this case there are 32 work items on a hardware thread.