- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
can someone comment on my question posted in the below link? especially regarding my Question#1?
https://forums.khronos.org/showthread.php/13009-Vectorization-on-various-opencl-implementations
thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Intel OpenCL compiler for the Intel(R) Processor Graphics will compile your kernel SIMD32 (if it is really small one and uses upto 128 bytes or private memory per work item), SIMD16 (medium kernel with between 129 and upto 256 bytes of private memory per work item) or SIMD8 (large kernel with between 257 and upto 512 bytes of private memory per work item), meaning that either 32, 16, or 8 work items will be packed onto a hardware thread. This is the autovectorization that we do. The thing is, in a lot of cases, reading and writing data is much more efficient when you deal with float4s (uint4s, int4s or uchar16s, etc.) - our architecture's sweet spot. In addition, many times you can benefit from more computational density per work item.
I cannot answer for NVidia and AMD, since I am not very familiar with their architectures. Maybe someone else can chime in.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert Ioffe (Intel) wrote:
The thing is, in a lot of cases, reading and writing data is much more efficient when you deal with float4s (uint4s, int4s or uchar16s, etc.) - our architecture's sweet spot. In addition, many times you can benefit from more computational density per work item
hi Robert, thanks for the comment. in fact, the lines that I had to manually vectorize indeed operate on float4 data structures, but somehow, the CPU OCL did not recognize those could be SIMDed. see
I even added the fourth components (htime[0].w) to facilitate Intel OCL to recognize and auto-vectorize the function, but did not seem to help. Running the function with the below lines still give me the slow speed. I am not sure why the CPU driver fail to vectorize the below lines.
htime[0].x=fabs(floor(p0[0].x)+(v[0].x>0.f)-p0[0].x); htime[0].y=fabs(floor(p0[0].y)+(v[0].y>0.f)-p0[0].y); htime[0].z=fabs(floor(p0[0].z)+(v[0].z>0.f)-p0[0].z); htime[0].w=fabs(floor(p0[0].w)+(v[0].w>0.f)-p0[0].w); htime[0].x=fabs(native_divide(htime[0].x,v[0].x)); htime[0].y=fabs(native_divide(htime[0].y,v[0].y)); htime[0].z=fabs(native_divide(htime[0].z,v[0].z)); htime[0].w=fabs(native_divide(htime[0].w,v[0].w));

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page