We're using OpenCL to process Direct3D11 4K textures on Intel's GPU (HD graphics family) on Windows 10 machine. The result of this processing is then read back to the CPU. Although the OpenCL kernel itself runs fast enough for our needs (~4ms), we experience an overhead of about 12-20 ms per frame and sometimes more.
Sorry for the delayed reply. Is your application anything like the example here?
One reason for asking is that a small standalone reproducer will help us to analyze. This reproducer should be minimal, easy to understand, and usually not your entire application.