OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Constant lag between the operations submit and start

Yossi_C_
Beginner
336 Views

We're using OpenCL to process Direct3D11 4K textures on Intel's GPU (HD graphics family) on Windows 10 machine. The result of this processing is then read back to the CPU. Although the OpenCL kernel itself runs fast enough for our needs (~4ms), we experience an overhead of about 12-20 ms per frame and sometimes more.

 
The basic flow is this:

* Create D3D11 device
*.Create an OpenCL device, with the D3D11 device as CL_CONTEXT_PLATFORM 
* Map D3D11 texture to an OpenCL image (clCreateFromD3D11Texture2DKHR)
* Create OpenCL output buffer with flags CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR
 
Then, for each frame (there are 30 fps usually):
1. Fill the D3D11 texture (other component does that in DirectX)
2. Unmap the output buffer, so it will be "owned" by the OpenCL device (enqueueUnmapMemObject)
3. Acquire the OpenCL image (clEnqueueAcquireD3D11ObjectsKHR)
4. Execute the kernel (enqueueNDRangeKernel)
5. Release the OpenCL image (clEnqueueReleaseD3D11ObjectsKHR)
6. Map the output buffer (blocking enqueueMapBuffer)
 
Except for item 5, all operations are unblocking. We do run clFlush() after each command to make sure that it is submitted as soon as possible,

What we see (by measuring with CL_QUEUE_PROFILING_ENABLE)  is that there's a constant lag between the operations' submit and start (CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START). The lag is most prominent for the clEnqueueAcquireD3D11ObjectsKHR, but is not negligible for other operations, too.
We also notices that the lag is larger when dealing with larger textures. We suspect that it indicates some behind-the-scenes texture copies that are done, but couldn't pinpoint it.
 
As an example, here are typical results for processing 4K textures.
For each operation, we log the total time it took, followed by values for the duration of submit-queued, start-submit, end-start.
Notice the bold column that shows the time it took the acquire operation to the start after it was already submitted.
 
unmap   4 /   0   4   0 acquire  12 /   0  12   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  24 t-k  19
unmap   1 /   0   1   0 acquire  10 /   0  10   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  22 t-k  17
unmap   1 /   0   1   0 acquire  11 /   0  11   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  22 t-k  17
unmap   1 /   0   1   0 acquire  10 /   0  10   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  22 t-k  17
unmap   1 /   0   1   0 acquire  10 /   0  10   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  22 t-k  17
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  19 t-k  14
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   4 /   0   4   0 acquire  13 /   0  13   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  24 t-k  19
unmap   5 /   0   5   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  21 t-k  16
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
unmap   0 /   0   0   0 acquire   9 /   0   9   0 kernel   5 /   0   3   2 release   4 /   0   4   0 map   0 /   0   0   0 total  20 t-k  15
 
We'll be grateful for any help or hints that you could provide.
0 Kudos
1 Reply
Jeffrey_M_Intel1
Employee
336 Views

Sorry for the delayed reply.  Is your application anything like the example here?

https://software.intel.com/en-us/articles/sharing-surfaces-between-opencl-and-directx-11-on-intel-processor-graphics

One reason for asking is that a small standalone reproducer will help us to analyze.   This reproducer should be minimal, easy to understand, and usually not your entire application.

0 Kudos
Reply