- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a kernel with a "required subgroup size" of 8.
My test is launching a grid of 24 global work items and 8 local work items (only for testing purposes).
After much debugging, the sub_group_broadcast() function was determined to be the culprit.
Replacing it with work_group_broadcast() resulted in a working kernel.
Is this a known bug?
All of the other sub_group_XXX() functions appear to be working.
-Allan
Platform: Win10 x64, HD 530, 21.20.16.4552.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for this report. I have not seen this on the bug list. Is there anything you can send us as a reproducer?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried a bunch of workarounds this morning including building a repro case.
The repro case works (attached at bottom) in isolation.
I'm broadcasting a 64-bit ulong across the subgroup so I resorted to printf() and ... it revealed that only the low dword of the 64-bit ulong was being broadcast -- the high dword was 0.
The quick workaround? The ulong I was broadcasting was a nice union type that besides exposing a ulong it also exposed a lo and hi uint so explicitly splitting the broadcast into lo and hi broadcasts worked around the problem.
// sg_lid = [0,7] // keys is a sub group wide register with a different key in each lane/item // key is broadcast and then processed by the subgroup #if 0 key.b64 = sub_group_broadcast(keys.b64,sg_lid); // FAIL #elif 1 key.lo.b32 = sub_group_broadcast(keys.lo.b32,sg_lid); // WORKS key.hi.b32 = sub_group_broadcast(keys.hi.b32,sg_lid); #else key.b64 = work_group_broadcast(keys.b64,sg_lid); // WORKS BUT BAD #endif
So... the compiler is failing somewhere.
I can't send my codebase at this time so my report isn't very helpful.
The working repro case for broadcasting ulongs is below:
__kernel __attribute__((intel_reqd_sub_group_size(8))) void bug_sub_group_broadcast(__global ulong const * restrict const vin, __global ulong * restrict const vout) { uint const base = (uint)get_group_id(0) * get_enqueued_num_sub_groups() + get_sub_group_id(); ulong t_s = vin[base * 8 + get_sub_group_local_id()]; for (int ii=0; ii<8; ii++) { vout[base * 8 * 8 + ii * 8 + get_sub_group_local_id()] = sub_group_broadcast(t_s, ii); } }
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page