OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

How to force SIMD32 compilation?

Vladimir_T_
Beginner
508 Views

Hi,

I have some code that was developed for CUDA and it relies heavily on 32-wide warps.

Is there a way to force Intel GPU compiler to compile for SIMD32?

Thanks!

0 Kudos
5 Replies
Robert_I_Intel
Employee
508 Views

Hi Vladimir,

There is, but, unfortunately, it is not exposed in publicly available drivers. SIMD32 is selected automatically for small kernels only, e.g. kernels smaller than ~150 assembly instructions.

I would recommend setting the work group size to 32 - this should give you the correct behavior.

0 Kudos
Vladimir_T_
Beginner
508 Views

thanks!

My workgroup size has to be 128 or 256 due to heavy shared memory usage. and the kernels are pretty large with more than 5k instructions. Also Opencl for AMD cards uses close to 64 registers (NVidia's compiler correctly switches to spilling a few and optimizes them down to 32 register per thread for maximum occupancy).

Will setting required workgroup size to 256 or 128 force SIMD32 in this environment?

0 Kudos
Robert_I_Intel
Employee
508 Views

Hi Vladimir,

Setting work group size to 128 and 256 would ensure correct behavior. With large kernels and heavy shared memory use, you are most likely to get SIMD8 compilation - forcing SIMD32, even if it was publicly available, wouldn't help - in fact, it would probably harm performance, in my experience. SIMD8 or SIMD16 compilation should not impact the correctness of the behavior though. 

Let me know, if after translation to OpenCL you experience correctness issues.

0 Kudos
Vladimir_T_
Beginner
508 Views

Hm,

the code specifically relies on warp_size being at least 32, so it does not have to call barriers all the time. It does aggregation between 32 threads in a warp and has only mem_fence between appropriate reads/writes.

I'd assume that if it compiles with SIMD16 or 8, the warp_size will be 16 or 8, and threads may diverge and cause correctness issues. Is this the  case? Do I understand relationship between warp_Size, SIMDXX and the way work is scheduled correctly?

0 Kudos
Robert_I_Intel
Employee
508 Views

Yes, if you have a code like the following:

if (i < 32)

foo += foo + foo[i + 32];

if (i < 16)

foo += foo + foo[i + 16];

if (i < 8)

foo += foo + foo[i + 8];

if (i < 4)

foo += foo + foo[i + 4];

​if (i < 2)

foo += foo + foo[i + 2];

if (i < 1)

foo += foo + foo[i + 1];

with no barriers, you will need to assume the worst case of SIMD8 and put couple of barriers in :(, so yes you may have correctness issues if you don't do that. If you can use OpenCL 2.0 (only on Broadwell chips, though), you can use work group scan functions for the cases above. 

0 Kudos
Reply