OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
告知
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1721 ディスカッション

Is there any GEN-friendly idiom for communicating subgroup uniformity?

allanmac1
ビギナー
511件の閲覧回数

It seems to me that GEN might benefit more from detecting "subgroup uniform" values than other architectures because of its unique register file architecture and instruction set.

Are there are any GEN idioms that you've discovered that nudge/help the compiler so it's able to determine that variables used by a subgroup are actually scalars (subgroup uniform)?

For example, would an idiom like this:

kernel foo(...)
{
  uint const sg_id = get_sub_group_id();

  if (sg_id == get_sub_group_id())
  {
    // rest of kernel
  }
}

or an idiom like this (perhaps better):

kernel foo(...)
{
  if (sub_group_all(true))
  {
    // rest of kernel
  }
}

... help the compiler determine that the subgroups are running "in isolation" and therefore any future function involving get_sub_group_id() (or similar) would be uniform?

I suspect this hasn't been implemented but it might be a useful idiom for both performance and reducing register pressure.

0 件の賞賛
2 返答(返信)
Jeffrey_M_Intel1
従業員
511件の閲覧回数

So far I have not been able to find anything that exactly fits.  However, we will keep this in mind for future documentation and features.

For now, would it help at all to set up "subgroup uniform" values using SLM or possibly images to take advantage of hardware shared within subgroups?  

allanmac1
ビギナー
511件の閲覧回数

Thanks for taking a look...

My workaround is to simply launch subgroup-wide workgroups (in this case 8 item workgroups).

That works really well on Skylake... but this might not be a long term solution and because of the local mem granularity rules, I'm unable to exploit all 64KB of local mem per subslice.

I would rather launch two workgroups with 28 SIMD8 subgroups each and have each subgroup obtain access to ~1700 bytes of local memory and let each subgroup run independently.

Bouncing data through SLM to help indicate uniformity is an option but I still think the code generation couldn't possibly be as good as actually knowing that a sequence is subgroup isolated.

You could always provide us a GEN assembler! :)

 

返信