auto-vectorization question

QFang1 · ‎03-17-2016

can someone comment on my question posted in the below link? especially regarding my Question#1?

https://forums.khronos.org/showthread.php/13009-Vectorization-on-various-opencl-implementations

thanks

Robert_I_Intel · ‎03-18-2016

Hi,

Intel OpenCL compiler for the Intel(R) Processor Graphics will compile your kernel SIMD32 (if it is really small one and uses upto 128 bytes or private memory per work item), SIMD16 (medium kernel with between 129 and upto 256 bytes of private memory per work item) or SIMD8 (large kernel with between 257 and upto 512 bytes of private memory per work item), meaning that either 32, 16, or 8 work items will be packed onto a hardware thread. This is the autovectorization that we do. The thing is, in a lot of cases, reading and writing data is much more efficient when you deal with float4s (uint4s, int4s or uchar16s, etc.) - our architecture's sweet spot. In addition, many times you can benefit from more computational density per work item.

I cannot answer for NVidia and AMD, since I am not very familiar with their architectures. Maybe someone else can chime in.

QFang1 · ‎03-18-2016

Robert Ioffe (Intel) wrote:
The thing is, in a lot of cases, reading and writing data is much more efficient when you deal with float4s (uint4s, int4s or uchar16s, etc.) - our architecture's sweet spot. In addition, many times you can benefit from more computational density per work item

hi Robert, thanks for the comment. in fact, the lines that I had to manually vectorize indeed operate on float4 data structures, but somehow, the CPU OCL did not recognize those could be SIMDed. see

https://github.com/fangq/mcxcl/commit/4bcfebdf37fb36fba56fd9bb46c12771e21a64b1#diff-3e7bff849d973dfbbbf2ff6591ee8862L216

I even added the fourth components (htime[0].w) to facilitate Intel OCL to recognize and auto-vectorize the function, but did not seem to help. Running the function with the below lines still give me the slow speed. I am not sure why the CPU driver fail to vectorize the below lines.

      htime[0].x=fabs(floor(p0[0].x)+(v[0].x>0.f)-p0[0].x);
      htime[0].y=fabs(floor(p0[0].y)+(v[0].y>0.f)-p0[0].y);
      htime[0].z=fabs(floor(p0[0].z)+(v[0].z>0.f)-p0[0].z);
      htime[0].w=fabs(floor(p0[0].w)+(v[0].w>0.f)-p0[0].w);

      htime[0].x=fabs(native_divide(htime[0].x,v[0].x));
      htime[0].y=fabs(native_divide(htime[0].y,v[0].y));
      htime[0].z=fabs(native_divide(htime[0].z,v[0].z));
      htime[0].w=fabs(native_divide(htime[0].w,v[0].w));