OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

Does OpenCL context go to "idling stage" if it has nothing to do


Hi all, I have a project that uses OpenCL for computation. Below behavior is quite strange to me, any help is appreciated!

I can't post my code in detail here, but the pseudo-code is:

// STEP 1: Uploading input from CPU to GPU (using clEnqueueWriteBuffer)

// STEP 2: Running several kernels for computation

// STEP 3: Do some CPU code (probably 100ms or more)

// STEP 4: Uploading another input from CPU to GPU (using clEnqueueWriteBuffer)

The input size (in bytes) in STEP 1 is the same as that in STEP 4. It took ~0.5ms to transfer data in step 1, while ~10ms to transfer data in STEP 4. I also called sync (clFinish) before and after each step. Any ideas why this could happen? I suspect that Intel driver put my OpenCL context/queue to "idle-stage" and it needs a little time to "wake" things up.

P.s: the performance of step 1 and step 4 are the same in NVIDIA & AMD devices.

0 Kudos
1 Reply

Hi HuyL,

Thanks for the post. This is a very useful discussion topic.

Proprietary code that developers don't have the permission to share isn't suitable for the forum. However, generic reproducers can be. Based on your pseudocode, it's possible a reproducer could show similar behavior. Could you prepare one and attach it?

Separately, take advantage of mapping pointers on Intel® Graphics Technology, as graphics domain kernels can be exposed to the same address space. This has been observed to provide significant speed up over clEnqueueWriteBuffer(...) calls. Link:

Also, forcing synchronous behavior comes with a performance penalty. Consider refactoring the code to be asynchronous where it can be.

These two things above are some of the most common performance considerations with OpenCL programming. You may also want to check Intel® Vtune™ Amplifier GPU hotspots mode. You should get some good API call feedback over the life of the program.

The opensource clIntercept Layer project up on github could also help you: