There is, but, unfortunately, it is not exposed in publicly available drivers. SIMD32 is selected automatically for small kernels only, e.g. kernels smaller than ~150 assembly instructions.
I would recommend setting the work group size to 32 - this should give you the correct behavior.
My workgroup size has to be 128 or 256 due to heavy shared memory usage. and the kernels are pretty large with more than 5k instructions. Also Opencl for AMD cards uses close to 64 registers (NVidia's compiler correctly switches to spilling a few and optimizes them down to 32 register per thread for maximum occupancy).
Will setting required workgroup size to 256 or 128 force SIMD32 in this environment?
Setting work group size to 128 and 256 would ensure correct behavior. With large kernels and heavy shared memory use, you are most likely to get SIMD8 compilation - forcing SIMD32, even if it was publicly available, wouldn't help - in fact, it would probably harm performance, in my experience. SIMD8 or SIMD16 compilation should not impact the correctness of the behavior though.
Let me know, if after translation to OpenCL you experience correctness issues.
the code specifically relies on warp_size being at least 32, so it does not have to call barriers all the time. It does aggregation between 32 threads in a warp and has only mem_fence between appropriate reads/writes.
I'd assume that if it compiles with SIMD16 or 8, the warp_size will be 16 or 8, and threads may diverge and cause correctness issues. Is this the case? Do I understand relationship between warp_Size, SIMDXX and the way work is scheduled correctly?
Yes, if you have a code like the following:
if (i < 32) foo += foo + foo[i + 32]; if (i < 16) foo += foo + foo[i + 16]; if (i < 8) foo += foo + foo[i + 8]; if (i < 4) foo += foo + foo[i + 4]; if (i < 2) foo += foo + foo[i + 2]; if (i < 1) foo += foo + foo[i + 1];
with no barriers, you will need to assume the worst case of SIMD8 and put couple of barriers in :(, so yes you may have correctness issues if you don't do that. If you can use OpenCL 2.0 (only on Broadwell chips, though), you can use work group scan functions for the cases above.