Constant lag between the operations submit and start

Yossi_C_ · ‎08-15-2016

We're using OpenCL to process Direct3D11 4K textures on Intel's GPU (HD graphics family) on Windows 10 machine. The result of this processing is then read back to the CPU. Although the OpenCL kernel itself runs fast enough for our needs (~4ms), we experience an overhead of about 12-20 ms per frame and sometimes more.

The basic flow is this:

* Create D3D11 device

*.Create an OpenCL device, with the D3D11 device as CL_CONTEXT_PLATFORM

* Map D3D11 texture to an OpenCL image (clCreateFromD3D11Texture2DKHR)

* Create OpenCL output buffer with flags CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR

Then, for each frame (there are 30 fps usually):
1. Fill the D3D11 texture (other component does that in DirectX)

2. Unmap the output buffer, so it will be "owned" by the OpenCL device (enqueueUnmapMemObject)

3. Acquire the OpenCL image (clEnqueueAcquireD3D11ObjectsKHR)

4. Execute the kernel (enqueueNDRangeKernel)

5. Release the OpenCL image (clEnqueueReleaseD3D11ObjectsKHR)

6. Map the output buffer (blocking enqueueMapBuffer)

.

Except for item 5, all operations are unblocking. We do run clFlush() after each command to make sure that it is submitted as soon as possible,

What we see (by measuring with CL_QUEUE_PROFILING_ENABLE) is that there's a constant lag between the operations' submit and start (CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START). The lag is most prominent for the clEnqueueAcquireD3D11ObjectsKHR, but is not negligible for other operations, too.

We also notices that the lag is larger when dealing with larger textures. We suspect that it indicates some behind-the-scenes texture copies that are done, but couldn't pinpoint it.

As an example, here are typical results for processing 4K textures.
For each operation, we log the total time it took, followed by values for the duration of submit-queued, start-submit, end-start.
Notice the bold column that shows the time it took the acquire operation to the start after it was already submitted.

unmap 4 / 0 4 0 acquire 12 / 0 12 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 24 t-k 19

unmap 1 / 0 1 0 acquire 10 / 0 10 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17

unmap 1 / 0 1 0 acquire 11 / 0 11 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17

unmap 1 / 0 1 0 acquire 10 / 0 10 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17

unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15

unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 19 t-k 14

unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15

unmap 4 / 0 4 0 acquire 13 / 0 13 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 24 t-k 19

unmap 5 / 0 5 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 21 t-k 16

unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15

We'll be grateful for any help or hints that you could provide.

Jeffrey_M_Intel1 · ‎09-09-2016

Sorry for the delayed reply. Is your application anything like the example here?

https://software.intel.com/en-us/articles/sharing-surfaces-between-opencl-and-directx-11-on-intel-processor-graphics

One reason for asking is that a small standalone reproducer will help us to analyze. This reproducer should be minimal, easy to understand, and usually not your entire application.