OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1686 Discussions

OpenCL performance vs OpenMP

Have there been any studies done camparing OpenCL performance to OpenMP? Specifically I am interested in the overhead cost of launching threads with OpenCL, e.g., if one were to decompose the domain into individual work items with each thread doing a small job versus heavier weight threads in OpenMP were the domain was decomposed into sub domains whose number equals the number of cores.
It seems that the OpenCL programming model is more targeted towards massively parallel chips, GPUs for instance rather than CPUs that have fewer but more powerful cores.
Can OpenCL be an effective replacement for OpenMP?
0 Kudos
1 Reply

While the focus of many materials about OpenCL is the data parallel programming model, it's important to keep in mind that OpenCL also supports a task-parallel programming model, which is more oriented at compute devices with fewer compute units that are relatively strong.

That being said, we believe the Intel OpenCL SDK for the CPU can offer good performance on the CPU for data parallel workloads as well. Probably the best way is for you to try it out yourself to see which of the Intel solutions fits your needs the best.

To answer your specific question, the thread launch time between OpenMP and Intel's OpenCL should be quite similar, but you probably meant to ask about execution overhead and not the actual thread launch. OpenCL semantics require error checking and other overheads, that aren't present in OpenMP. However, these do not scale with the size of the workload, but rather are constant per call to clEnqueueNDRange.
You can measure exactly how big a penalty is incurred by comparing a wallclock measurement to the execution time reported by the kernel object via the clGetEventProfilingInfo API - just make sure the wallclock measurement captures all the execution, since clEnqueueNDRange is asynchronous.

For more information, you can also check the optimization guide.
Doron Singer