topic How can I reduce start latencies with OpenCL on the GPU? in OpenCL* for CPU

How can I reduce start latencies with OpenCL on the GPU?

tony_w_ — Fri, 20 Jan 2017 09:55:40 GMT

I'm evaluating an Intel platform for an embedded real-time processor in our systems. Our application uses OpenCL to prcoess incoming data on a very short cycle in real-time. It is critical to the system that it is able to keep up with the input data stream. Latency between input and output is also critical so we are not able to batch up data and process it in larger quantities. For these reasons, start latency for tasks on the OpenCL command queue is as critical as kernel processing speed.

One processing cycle looks something like this. Steps 7 to 11 are all using events to trigger the next step.

Map input buffer
Queue unmap input buffer (to be triggered by a user event)
Queue kernels
Queue map output buffer
Copy data in
Trigger unmap
Unmap
Kernel 1
Kernel 2
Kernel 3
Map output buffer
Copy data out

This sequence works very well on OpenCL on a different (non-Intel) processor but seems to suffer longer start latency than expected on this processor. Examples of latency (microseconds) between the some of these steps is shown below.

end 7 (unmap) to start 8 (kernel 1) 700 - 1400
end 8 (kernel 1) to start 9 (kernel 2) 400 - 900
end 9 (kernel 2) to start 10 (kernel 3) 400 - 700
end 10 (kernel 3) to start 11 (map) 300 - 600

These times are huge for our system which operates on a short real-time cycle.

Does anyone have some insight into what might be causing this and how we could reduce the times? Some specifics of the system are given below in case they might help.

Thanks, Tony

Linux: Yocto from the Apollo Lake BSP release gold, build core-image-sato-sdk, installed on onboard eMMC.

Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 8GB DDR3 RAM

OpenCL: installed user space drivers from SRB4 https://software.intel.com/file/533571/download

Hello Tony,

Michal_M_Intel — Fri, 20 Jan 2017 16:38:25 GMT

Hello Tony,

Could you provide a reproducer for the API sequence?

Latency seems to be too high ( especially delta #2,#3 and #4), therefore better understanding of exact API calls / events sequence and resource setup would help in this case.

Tony, this Intel presentation

allanmac1 — Fri, 20 Jan 2017 16:50:11 GMT

Tony, this Intel presentation might be relevant to your work:

http://www.iwocl.org/wp-content/uploads/iwocl-2016-gpu-daemon.pdf

GPU daemon – Road to Zero Cost Submission

Michal Mrozek and Zbigniew Zdanowicz (Intel)

One of the biggest problems of OpenCL efficient usage is the latency submission. Time needed to pass through the driver stack is so significant that it limits the use of OpenCL on GPU in applications requiring low-latency. This presentation we present a novel approach utilizing new features of OpenCL 2.0 : Fine-Grained SVM and device enqueue_kernel that allows completely new usage models. We will present the idea of GPU daemon that operates using different modes (polling, enqueue_kernel and monitored_fence) and offers various levels of flexibility for the end user application. Part of presentation will show the data & code samples for each approach and will also compare each mode with the traditional submission model.