I've just recently started programming opencl on my IvyBridge GT2 (16 EU) powered Laptop, however results don't look that promising for my use-case. To narrow things down, I started with a very basic kernel which traverses a buffer holding 2d image data:
__kernel void image_scaling(__global const char* in, __global char* out, int inputStride)
unsigned int idx = get_global_id(1) * inputStride + get_global_id(0);
char input_value = in[idx];
out[idx] = input_value + 50;
However even this simple listing takes almost 1.8 ms for execution alone for a 1MB buffer, according to CodeBuilder's analyze function, which corresponds to roughly ~550mb/s throughput. I know I could use a 2d image object, but that wouldn't work for my full example.
Any ideas why the kernel executes that slow? What are my options for profiling? I've worked with OpenCL Code builder, but its analyze function doesn't give me any insights where bottlenecks are. Also is there any way to see the actual Code generated for the GPU and estimates how many cycles a kernel would require?
Thank you in advance, Clemens
char data type is broken on IvyBridge: you will need to switch to uchar4 or uchar16 to achieve decent performance.
Please see my videos and samples on optimizing simple kernels here: https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels
Vtune is your best bet, you can obtain a free eval: https://software.intel.com/en-us/intel-vtune-amplifier-xe/
No way to see actual assembly yet, if you are outside of Intel.