OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

auto-vectorization question

QFang1
Novice
122 Views

can someone comment on my question posted in the below link? especially regarding my Question#1?

https://forums.khronos.org/showthread.php/13009-Vectorization-on-various-opencl-implementations

thanks

0 Kudos
2 Replies
Robert_I_Intel
Employee
122 Views

Hi,

Intel OpenCL compiler for the Intel(R) Processor Graphics will compile your kernel SIMD32 (if it is really small one and uses upto 128 bytes or private memory per work item), SIMD16 (medium kernel with between 129 and upto 256 bytes of private memory per work item) or SIMD8 (large kernel with between 257 and upto 512 bytes of private memory per work item), meaning that either 32, 16, or 8 work items will be packed onto a hardware thread. This is the autovectorization that we do. The thing is, in a lot of cases, reading and writing data is much more efficient when you deal with float4s (uint4s, int4s or uchar16s, etc.) - our architecture's sweet spot. In addition, many times you can benefit from more computational density per work item.

I cannot answer for NVidia and AMD, since I am not very familiar with their architectures. Maybe someone else can chime in.

QFang1
Novice
122 Views

Robert Ioffe (Intel) wrote:
The thing is, in a lot of cases, reading and writing data is much more efficient when you deal with float4s (uint4s, int4s or uchar16s, etc.) - our architecture's sweet spot. In addition, many times you can benefit from more computational density per work item

hi Robert, thanks for the comment. in fact, the lines that I had to manually vectorize indeed operate on float4 data structures, but somehow, the CPU OCL did not recognize those could be SIMDed. see

https://github.com/fangq/mcxcl/commit/4bcfebdf37fb36fba56fd9bb46c12771e21a64b1#diff-3e7bff849d973dfb...

I even added the fourth components (htime[0].w) to facilitate Intel OCL to recognize and auto-vectorize the function, but did not seem to help. Running the function with the below lines still give me the slow speed. I am not sure why the CPU driver fail to vectorize the below lines.

      htime[0].x=fabs(floor(p0[0].x)+(v[0].x>0.f)-p0[0].x);
      htime[0].y=fabs(floor(p0[0].y)+(v[0].y>0.f)-p0[0].y);
      htime[0].z=fabs(floor(p0[0].z)+(v[0].z>0.f)-p0[0].z);
      htime[0].w=fabs(floor(p0[0].w)+(v[0].w>0.f)-p0[0].w);

      htime[0].x=fabs(native_divide(htime[0].x,v[0].x));
      htime[0].y=fabs(native_divide(htime[0].y,v[0].y));
      htime[0].z=fabs(native_divide(htime[0].z,v[0].z));
      htime[0].w=fabs(native_divide(htime[0].w,v[0].w));

 

Reply