As far as I know this is still true. You can check with the clDeviceInfo CL_DEVICE_LOCAL_MEM_SIZE query. The amount of shared memory per subslice is still 64K on my gen9 (Skylake/6th Generation Core) processor. Gen9 continues support for 16 active work groups per subslice.
I'm observing no performance difference between launching 8-item work groups (subgroup == workgroup) that only require 512 bytes of SLM vs. work groups of at least 64 item work groups that require 512x8 (4KB) of SLM.
No barriers are being used.
This is on Skylake + Win10/x64.
It might be a poor test since I'm still spilling a lot of bytes per subgroup and spills really punish performance. Perhaps they're dominating the benchmark...
I have an update to the info I previously posted. For 6th Generation Core/Skylake/Gen9 there are 32 barrier registers, meaning 32 active work groups per subslice.
Spilling registers to global memory will make your kernel highly I/O bound -- as you mention, this could be the main bottleneck with optimizations elsewhere not making much of a difference.
Have you tried Code Builder kernel analysis? Experiments with this tool could give you some insights, such as what parts of your code are efficiently compiling to Gen instructions.
If your current kernel is spilling, would it make sense to build up a simpler kernel and test it? That is, start with your memory transfers then add something simple to touch SLM so you can double check that the memory movement part is behaving as expected without all of the extra I/O the spillage is causing?
Unfortunately, Code Builder dies on my kernel so I'm flying blind. :(
I can construct a non-spilling kernel with missing functionality and will try to see if the 4KB minimum allocation reveals itself.
The minimum SLM allocation granularity was changed to 1KB on Gen9. Search for "Shared Local Memory Size" in INTERFACE_DESCRIPTOR_DATA here:
That said, I agree that the performance you're seeing is likely due to register spilling and filling, so that's the first problem to solve.
Thanks for confirming that the min is now 1KB.
Being able to launch an SLM-using-but-no-barrier subgroup per hardware thread (56) is a win.
Hopefully I'm observing 56 threads per subslice in action and they're not being capped at 32 even though I'm not using barriers.
A few more observations:
- It would be cool if the IOC assembly dump identified which command was being invoked next to the SEND instruction.
- get_sub_group_id() appears to use the MATH.IQOT instruction. Why not a SHR?
- get_sub_group_id() could potentially be considered "uniform per subgroup" but although that might be a little more difficult to express it would only require get_max_sub_group_size() registers to represent. It's also interesting that I'm seeing significantly different performance from my subgroup-centric kernel when I compute an initial index in one of the following ways:
#if 1 // FASTER -- just one subgroup per workgroup uint const idx = get_group_id(0); #else // SLOWER -- assumes more than one subgroup per workgroup uint const idx = get_group_id(0) * SUBGROUPS_PER_WORKGROUP + get_sub_group_id(); #endif