OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

Intel GEN SLM allocation granularity still 4KB per workgroup?

allanmac1
Beginner
693 Views

Given a kernel that uses no barriers, does this recommendation still hold for GEN8 and beyond?

https://software.intel.com/en-us/node/540442

NOTE

A bare minimum SLM allocation size is 4k per workgroup, so even if your kernel requires less bytes per work-group, the actual allocation still will be 4k. To accommodate many potential execution scenarios try to minimize local memory usage to fit the optimal value of 4K per workgroup. Also notice that the granularity of SLM allocation is 1K.

 

0 Kudos
6 Replies
Jeffrey_M_Intel1
Employee
693 Views

As far as I know this is still true.  You can check with the clDeviceInfo CL_DEVICE_LOCAL_MEM_SIZE query.  The amount of shared memory per subslice is still 64K on my gen9 (Skylake/6th Generation Core) processor.   Gen9 continues support for 16 active work groups per subslice.

0 Kudos
allanmac1
Beginner
693 Views

BTW,

I'm observing no performance difference between launching 8-item work groups (subgroup == workgroup) that only require 512 bytes of SLM vs. work groups of at least 64 item work groups that require 512x8 (4KB) of SLM.

No barriers are being used.

This is on Skylake + Win10/x64.

It might be a poor test since I'm still spilling a lot of bytes per subgroup and spills really punish performance.  Perhaps they're dominating the benchmark...

 

0 Kudos
Jeffrey_M_Intel1
Employee
693 Views

I have an update to the info I previously posted.  For 6th Generation Core/Skylake/Gen9 there are 32 barrier registers, meaning 32 active work groups per subslice.

Spilling registers to global memory will make your kernel highly I/O bound -- as you mention, this could be the main bottleneck with optimizations elsewhere not making much of a difference.

Have you tried Code Builder kernel analysis?  Experiments with this tool could give you some insights, such as what parts of your code are efficiently compiling to Gen instructions.

If your current kernel is spilling, would it make sense to build up a simpler kernel and test it?  That is, start with your memory transfers then add something simple to touch SLM so you can double check that the memory movement part is behaving as expected without all of the extra I/O the spillage is causing?

0 Kudos
allanmac1
Beginner
693 Views

Unfortunately, Code Builder dies on my kernel so I'm flying blind. :(

I can construct a non-spilling kernel with missing functionality and will try to see if the 4KB minimum allocation reveals itself.

0 Kudos
Ben_A_Intel
Employee
693 Views

The minimum SLM allocation granularity was changed to 1KB on Gen9.  Search for "Shared Local Memory Size" in INTERFACE_DESCRIPTOR_DATA here:

https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol02d-commandreference-structures.pdf

That said, I agree that the performance you're seeing is likely due to register spilling and filling, so that's the first problem to solve.

0 Kudos
allanmac1
Beginner
693 Views

Thanks for confirming that the min is now 1KB.  

Being able to launch an SLM-using-but-no-barrier subgroup per hardware thread (56) is a win.  

Hopefully I'm observing 56 threads per subslice in action and they're not being capped at 32 even though I'm not using barriers.

A few more observations:

  1. It would be cool if the IOC assembly dump identified which command was being invoked next to the SEND instruction.
  2. get_sub_group_id() appears to use the MATH.IQOT instruction.  Why not a SHR? 
  3. get_sub_group_id() could potentially be considered "uniform per subgroup" but although that might be a little more difficult to express it would only require get_max_sub_group_size() registers to represent.  It's also interesting that I'm seeing significantly different performance from my subgroup-centric kernel when I compute an initial index in one of the following ways:
#if 1 // FASTER -- just one subgroup per workgroup
  uint const idx = get_group_id(0);
#else // SLOWER -- assumes more than one subgroup per workgroup
  uint const idx = get_group_id(0) * SUBGROUPS_PER_WORKGROUP + get_sub_group_id();
#endif

 

0 Kudos
Reply