We're using OpenCL to process Direct3D11 4K textures on Intel's GPU (HD graphics family) on Windows 10 machine. The result of this processing is then read back to the CPU. Although the OpenCL kernel itself runs fast enough for our needs (~4ms), we experience an overhead of about 12-20 ms per frame and sometimes more.
* Create D3D11 device
1. Fill the D3D11 texture (other component does that in DirectX)
What we see (by measuring with CL_QUEUE_PROFILING_ENABLE) is that there's a constant lag between the operations' submit and start (CL_PROFILING_COMMAND_SUBMIT,
For each operation, we log the total time it took, followed by values for the duration of submit-queued, start-submit, end-start.
Notice the bold column that shows the time it took the acquire operation to the start after it was already submitted.
Sorry for the delayed reply. Is your application anything like the example here?
One reason for asking is that a small standalone reproducer will help us to analyze. This reproducer should be minimal, easy to understand, and usually not your entire application.