OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1717 Discussions

Low throughput - how to diagnose?



I've just recently started programming opencl on my IvyBridge GT2 (16 EU) powered Laptop, however results don't look that promising for my use-case. To narrow things down, I started with a very basic kernel which traverses a buffer holding 2d image data:

__kernel void image_scaling(__global const char* in, __global char* out, int inputStride)


unsigned int idx = get_global_id(1) * inputStride + get_global_id(0);

char input_value = in[idx];

out[idx] = input_value + 50;



However even this simple listing takes almost 1.8 ms for execution alone for a 1MB buffer, according to CodeBuilder's analyze function, which corresponds to roughly ~550mb/s throughput. I know I could use a 2d image object, but that wouldn't work for my full example.


Any ideas why the kernel executes that slow? What are my options for profiling? I've worked with OpenCL Code builder, but its analyze function doesn't give me any insights where bottlenecks are. Also is there any way to see the actual Code generated for the GPU and estimates how many cycles a kernel would require?


Thank you in advance, Clemens

0 Kudos
1 Reply

Hi Clemens,

char data type is broken on IvyBridge: you will need to switch to uchar4 or uchar16 to achieve decent performance.

Please see my videos and samples on optimizing simple kernels here:

Vtune is your best bet, you can obtain a free eval:

No way to see actual assembly yet, if you are outside of Intel.

0 Kudos