- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We're using OpenCL to process Direct3D11 4K textures on Intel's GPU (HD graphics family) on Windows 10 machine. The result of this processing is then read back to the CPU. Although the OpenCL kernel itself runs fast enough for our needs (~4ms), we experience an overhead of about 12-20 ms per frame and sometimes more.
The basic flow is this:
* Create D3D11 device
* Create D3D11 device
*.Create an OpenCL device, with the D3D11 device as CL_CONTEXT_PLATFORM
* Map D3D11 texture to an OpenCL image (clCreateFromD3D11Texture2DKHR )
* Create OpenCL output buffer with flags CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR
Then, for each frame (there are 30 fps usually):
1. Fill the D3D11 texture (other component does that in DirectX)
1. Fill the D3D11 texture (other component does that in DirectX)
2. Unmap the output buffer, so it will be "owned" by the OpenCL device (enqueueUnmapMemObject)
3. Acquire the OpenCL image (clEnqueueAcquireD3D11ObjectsK HR)
4. Execute the kernel (enqueueNDRangeKernel)
5. Release the OpenCL image (clEnqueueReleaseD3D11ObjectsK HR)
6. Map the output buffer (blocking enqueueMapBuffer)
.
Except for item 5, all operations are unblocking. We do run clFlush() after each command to make sure that it is submitted as soon as possible,
What we see (by measuring with CL_QUEUE_PROFILING_ENABLE) is that there's a constant lag between the operations' submit and start (CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START). The lag is most prominent for the clEnqueueAcquireD3D11Objec tsKHR, but is not negligible for other operations, too.
What we see (by measuring with CL_QUEUE_PROFILING_ENABLE) is that there's a constant lag between the operations' submit and start (CL_PROFILING_COMMAND_SUBMIT,
We also notices that the lag is larger when dealing with larger textures. We suspect that it indicates some behind-the-scenes texture copies that are done, but couldn't pinpoint it.
As an example, here are typical results for processing 4K textures.
For each operation, we log the total time it took, followed by values for the duration of submit-queued, start-submit, end-start.
Notice the bold column that shows the time it took the acquire operation to the start after it was already submitted.
For each operation, we log the total time it took, followed by values for the duration of submit-queued, start-submit, end-start.
Notice the bold column that shows the time it took the acquire operation to the start after it was already submitted.
unmap 4 / 0 4 0 acquire 12 / 0 12 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 24 t-k 19
unmap 1 / 0 1 0 acquire 10 / 0 10 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17
unmap 1 / 0 1 0 acquire 11 / 0 11 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17
unmap 1 / 0 1 0 acquire 10 / 0 10 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17
unmap 1 / 0 1 0 acquire 10 / 0 10 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 22 t-k 17
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 19 t-k 14
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 4 / 0 4 0 acquire 13 / 0 13 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 24 t-k 19
unmap 5 / 0 5 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 21 t-k 16
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
unmap 0 / 0 0 0 acquire 9 / 0 9 0 kernel 5 / 0 3 2 release 4 / 0 4 0 map 0 / 0 0 0 total 20 t-k 15
We'll be grateful for any help or hints that you could provide.
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the delayed reply. Is your application anything like the example here?
One reason for asking is that a small standalone reproducer will help us to analyze. This reproducer should be minimal, easy to understand, and usually not your entire application.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page