- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OpenCL 2.0 has no support for a "ballot" style sub-group function. A ballot returns bitmask containing the conditional flag for each "lane" in the sub-group. As long as the sub-group (SIMD) size is 32 or less then this fits in a cl_uint.
Presumably sub-group any() and all() are implemented on Broadwell IGP by returning an ARF flag register?
It would be great if Broadwell IGP unofficially implemented sub_group_any() by returning the actual flag bitmask so that developers could apply popcount() and other operations to the mask.
For those not aware, a classic use case for a ballot mask is packing data in a sub-group into a local memory array without having to use a full exclusive add scan. It's very efficient.
You can implement a ballot() with an inclusive scan but that's going to be ~8x as many ops for SIMD16.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Allan,
Internally, we do have such a functionality. I am trying to figure out from our driver architects when we can get this functionality into a production driver.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Robert!
-Allan M.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One way of exposing portable ballot() functionality might be to use my suggestion here:
https://github.com/KhronosGroup/SPIRV-Headers/issues/9
The alternative solution at the bottom can be implemented with a simple compiler optimization and integrated immediately into Intel's OpenCL compiler.
Perhaps you're already doing this?
——————————————————————————
A native ballot()
operation is a useful primitive to exploit for warp/wave/simd work compaction.
A subgroup ballot()
operation is not exposed in SPIR-V or OpenCL (right?) and the existence of architectures with sub_group widths over 32 lanes preclude this from being represented with a uint32_t.
If the OpGroupIAdd opcode was relaxed to support differing return and argument types — specifically, an integer return type and boolean argument — then SPIR-V would be able to optionally efficiently express:
popcount( ballot() & lanes_less_than() )
popcount( ballot() & lanes_less_than_or_equal() )
popcount( ballot() )
This would then allow OpenCL to expose the following potentially optimal sub_group functions:
int sub_group_scan_exclusive_add(bool pred)
int sub_group_scan_inclusive_add(bool pred)
int sub_group_reduce_add(bool pred)
Alternatively, simply recognizing cases where the integer subgroup scan argument is guaranteed to be 0 or 1 would allow a native popcount( ballot() & lanes_mask_xxx() )
sequence to be emitted and the OpGroupIAdd opcode specification left as is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Allan,
Couple of questions: 1) are you or your company a Khronos member? 2) does your company have an NDA with Intel in place?
Our OpenCL driver architect just pointed out:
Of note, there’s also a related GLSL extension that the Vulkan folks are looking at adding:
https://www.opengl.org/registry/specs/ARB/shader_ballot.txt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I could do with ballot too. I am a Khronos member, but it's for an opensource project, so I dont think that will be useful particularly. Note that I'm fine with the solution being vendor-specific, eg inline assembler. For example, ballot is available on NVIDIA, using inline assembler, even though NVIDIA itself only supports OpenCL 1.2 https://github.com/hughperkins/neonCl-underconstruction/blob/52d46b105dd9780ef7120831e143bba466c0d165/neoncl/backends/kernels/cl/convolution_cl.py#L615-L630

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page