I have kernel which is defined as local size=16, global size=256, and in each workgroup there are 32KB shared memory allocated.
I run my kernel on Ivybridge 4000, and got the GPU idle state account for 75% percent, which is fine. As per half-slice there are 64KB shared memory, so only two (64KB/32KB) workgroups can be launched per half slice. Each workgroup schedule on one EU, so at most two EUs are active per half-slice, which brings us the idle number 1- (2 active EUs)/(8 EUs per half-slice) = 0.75.
However, when I run the same code on the Haswell HD4600 GPU, the idle state is only 20%. HD4600 has 20 EUs, 10 EU per half slice, each workgroup schedule on one EU, so the idle EUs is 4. This indicates all 16 hardware threads(or workgroups) are launched without shared memory constrain anymore.
So my question is, what kind of change has been made to haswell that it can launch workgroup without constrain of shared memory usage?