OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

Is there any GEN-friendly idiom for communicating subgroup uniformity?

allanmac1
Beginner
373 Views

It seems to me that GEN might benefit more from detecting "subgroup uniform" values than other architectures because of its unique register file architecture and instruction set.

Are there are any GEN idioms that you've discovered that nudge/help the compiler so it's able to determine that variables used by a subgroup are actually scalars (subgroup uniform)?

For example, would an idiom like this:

kernel foo(...)
{
  uint const sg_id = get_sub_group_id();

  if (sg_id == get_sub_group_id())
  {
    // rest of kernel
  }
}

or an idiom like this (perhaps better):

kernel foo(...)
{
  if (sub_group_all(true))
  {
    // rest of kernel
  }
}

... help the compiler determine that the subgroups are running "in isolation" and therefore any future function involving get_sub_group_id() (or similar) would be uniform?

I suspect this hasn't been implemented but it might be a useful idiom for both performance and reducing register pressure.

0 Kudos
2 Replies
Jeffrey_M_Intel1
Employee
373 Views

So far I have not been able to find anything that exactly fits.  However, we will keep this in mind for future documentation and features.

For now, would it help at all to set up "subgroup uniform" values using SLM or possibly images to take advantage of hardware shared within subgroups?  

0 Kudos
allanmac1
Beginner
373 Views

Thanks for taking a look...

My workaround is to simply launch subgroup-wide workgroups (in this case 8 item workgroups).

That works really well on Skylake... but this might not be a long term solution and because of the local mem granularity rules, I'm unable to exploit all 64KB of local mem per subslice.

I would rather launch two workgroups with 28 SIMD8 subgroups each and have each subgroup obtain access to ~1700 bytes of local memory and let each subgroup run independently.

Bouncing data through SLM to help indicate uniformity is an option but I still think the code generation couldn't possibly be as good as actually knowing that a sequence is subgroup isolated.

You could always provide us a GEN assembler! :)

 

0 Kudos
Reply