- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have some code that was developed for CUDA and it relies heavily on 32-wide warps.
Is there a way to force Intel GPU compiler to compile for SIMD32?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vladimir,
There is, but, unfortunately, it is not exposed in publicly available drivers. SIMD32 is selected automatically for small kernels only, e.g. kernels smaller than ~150 assembly instructions.
I would recommend setting the work group size to 32 - this should give you the correct behavior.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks!
My workgroup size has to be 128 or 256 due to heavy shared memory usage. and the kernels are pretty large with more than 5k instructions. Also Opencl for AMD cards uses close to 64 registers (NVidia's compiler correctly switches to spilling a few and optimizes them down to 32 register per thread for maximum occupancy).
Will setting required workgroup size to 256 or 128 force SIMD32 in this environment?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vladimir,
Setting work group size to 128 and 256 would ensure correct behavior. With large kernels and heavy shared memory use, you are most likely to get SIMD8 compilation - forcing SIMD32, even if it was publicly available, wouldn't help - in fact, it would probably harm performance, in my experience. SIMD8 or SIMD16 compilation should not impact the correctness of the behavior though.
Let me know, if after translation to OpenCL you experience correctness issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hm,
the code specifically relies on warp_size being at least 32, so it does not have to call barriers all the time. It does aggregation between 32 threads in a warp and has only mem_fence between appropriate reads/writes.
I'd assume that if it compiles with SIMD16 or 8, the warp_size will be 16 or 8, and threads may diverge and cause correctness issues. Is this the case? Do I understand relationship between warp_Size, SIMDXX and the way work is scheduled correctly?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, if you have a code like the following:
if (i < 32) foo += foo + foo[i + 32]; if (i < 16) foo += foo + foo[i + 16]; if (i < 8) foo += foo + foo[i + 8]; if (i < 4) foo += foo + foo[i + 4]; if (i < 2) foo += foo + foo[i + 2]; if (i < 1) foo += foo + foo[i + 1];
with no barriers, you will need to assume the worst case of SIMD8 and put couple of barriers in :(, so yes you may have correctness issues if you don't do that. If you can use OpenCL 2.0 (only on Broadwell chips, though), you can use work group scan functions for the cases above.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page