OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1698 Discussions

sub_group_broadcast() broken on GEN9 (


I have a kernel with a "required subgroup size" of 8.

My test is launching a grid of 24 global work items and 8 local work items (only for testing purposes).

After much debugging, the sub_group_broadcast() function was determined to be the culprit.

Replacing it with work_group_broadcast() resulted in a working kernel.

Is this a known bug?  

All of the other sub_group_XXX() functions appear to be working.


Platform: Win10 x64, HD 530,



0 Kudos
2 Replies

Thanks for this report.  I have not seen this on the bug list.  Is there anything you can send us as a reproducer?

0 Kudos

I tried a bunch of workarounds this morning including building a repro case.

The repro case works (attached at bottom) in isolation.

I'm broadcasting a 64-bit ulong across the subgroup so I resorted to printf() and ... it revealed that only the low dword of the 64-bit ulong was being broadcast -- the high dword was 0.

The quick workaround?  The ulong I was broadcasting was a nice union type that besides exposing a ulong it also exposed a lo and hi uint so explicitly splitting the broadcast into lo and hi broadcasts worked around the problem.

// sg_lid = [0,7]
// keys is a sub group wide register with a different key in each lane/item
// key is broadcast and then processed by the subgroup
#if   0
          key.b64    = sub_group_broadcast(keys.b64,sg_lid);    // FAIL
#elif 1
          key.lo.b32 = sub_group_broadcast(keys.lo.b32,sg_lid); // WORKS
          key.hi.b32 = sub_group_broadcast(keys.hi.b32,sg_lid);
          key.b64    = work_group_broadcast(keys.b64,sg_lid);   // WORKS BUT BAD

So... the compiler is failing somewhere.

I can't send my codebase at this time so my report isn't very helpful.

The working repro case for broadcasting ulongs is below:

bug_sub_group_broadcast(__global ulong const * restrict const vin, __global ulong * restrict const vout)
	uint const base = (uint)get_group_id(0) * get_enqueued_num_sub_groups() + get_sub_group_id();

	ulong t_s = vin[base * 8 + get_sub_group_local_id()];

	for (int ii=0; ii<8; ii++)
		vout[base * 8 * 8 + ii * 8 + get_sub_group_local_id()] = sub_group_broadcast(t_s, ii);


0 Kudos