I am transfering the reduction kernel from amd app sdk.
It requires setting work_group_size when you execute
clEnqueueNDRangeKernel with local_work_size that is different from 8 it crashes directly in tbb on Intel OpenCL for Intel CPU. The clEnqueueNDRRange successfully launches the kernel.
When you request work_group_size from the device it returns 8192 (should be 8 in this case) and the kernel work group size is 2048. It crashes with both settings.
Works only with the number of the cores.
I have Intel Haswell 4770K.
I have global_size = 4096;
Intel 4600 GPU works fine for all different sizes according to spec.
The project is located here:
the file that launches the kernel is located here:
CPU OpenCL version is very unforgiving if you are accessing global or local data out of bounds. I suspect that this is exactly what's going on. For example, in your kernel you have the following line
sdata[tid] += sdata[tid + s];
tid is unsigned int tid = get_local_id(0);
initial value of s is unsigned int s = localSize >> 1;
So, if your local size is 8, for the last work item in a workgroup, tid is 7 and s is 4, so tid+s is 11, but the size of sdata is 8.
Same problem with the following line:
sdata[tid] = input[stride] + input[stride + 1];
I believe you global memory accesses go out of bound too. Please size your global and local memory such that you don't access your data out of bounds. Otherwise, things will crash on the CPU.