OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1687 Discussions

Broadwell IGP needs more sub_group functions


OpenCL 2.0 has no support for a "ballot" style sub-group function.  A ballot returns bitmask containing the conditional flag for each "lane" in the sub-group.  As long as the sub-group (SIMD) size is 32 or less then this fits in a cl_uint.

Presumably sub-group any() and all() are implemented on Broadwell IGP by returning an ARF flag register?

It would be great if Broadwell IGP unofficially implemented sub_group_any() by returning the actual flag bitmask so that developers could apply popcount() and other operations to the mask.

For those not aware, a classic use case for a ballot mask is packing data in a sub-group into a local memory array without having to use a full exclusive add scan.  It's very efficient.

You can implement a ballot() with an inclusive scan but that's going to be ~8x as many ops for SIMD16.


0 Kudos
5 Replies


Internally, we do have such a functionality. I am trying to figure out from our driver architects when we can get this functionality into a production driver. 


Thanks Robert!

-Allan M.


One way of exposing portable ballot() functionality might be to use my suggestion here:

The alternative solution at the bottom can be implemented with a simple compiler optimization and integrated immediately into Intel's OpenCL compiler.

Perhaps you're already doing this?


A native ballot() operation is a useful primitive to exploit for warp/wave/simd work compaction.

A subgroup ballot() operation is not exposed in SPIR-V or OpenCL (right?) and the existence of architectures with sub_group widths over 32 lanes preclude this from being represented with a uint32_t.

If the OpGroupIAdd opcode was relaxed to support differing return and argument types — specifically, an integer return type and boolean argument — then SPIR-V would be able to optionally efficiently express:

popcount( ballot() & lanes_less_than() )
popcount( ballot() & lanes_less_than_or_equal() )
popcount( ballot() )

This would then allow OpenCL to expose the following potentially optimal sub_group functions:

int sub_group_scan_exclusive_add(bool pred)
int sub_group_scan_inclusive_add(bool pred)
int sub_group_reduce_add(bool pred)

Alternatively, simply recognizing cases where the integer subgroup scan argument is guaranteed to be 0 or 1 would allow a native popcount( ballot() & lanes_mask_xxx() ) sequence to be emitted and the OpGroupIAdd opcode specification left as is.



Hi Allan,

Couple of questions: 1) are you or your company a Khronos member? 2) does your company have an NDA with Intel in place?

Our OpenCL driver architect just pointed out:

Of note, there’s also a related GLSL extension that the Vulkan folks are looking at adding:

We have a lot of activity on this subject but nothing to announce publicly yet.

I could do with ballot too.  I am a Khronos member, but it's for an opensource project, so I dont think that will be useful particularly.  Note that I'm fine with the solution being vendor-specific, eg inline assembler.  For example, ballot is available on NVIDIA, using inline assembler, even though NVIDIA itself only supports OpenCL 1.2