- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm evaluating an Intel platform for an embedded real-time processor in our systems. Our application uses OpenCL to prcoess incoming data on a very short cycle in real-time. It is critical to the system that it is able to keep up with the input data stream. Latency between input and output is also critical so we are not able to batch up data and process it in larger quantities. For these reasons, start latency for tasks on the OpenCL command queue is as critical as kernel processing speed.
One processing cycle looks something like this. Steps 7 to 11 are all using events to trigger the next step.
- Map input buffer
- Queue unmap input buffer (to be triggered by a user event)
- Queue kernels
- Queue map output buffer
- Copy data in
- Trigger unmap
- Unmap
- Kernel 1
- Kernel 2
- Kernel 3
- Map output buffer
- Copy data out
This sequence works very well on OpenCL on a different (non-Intel) processor but seems to suffer longer start latency than expected on this processor. Examples of latency (microseconds) between the some of these steps is shown below.
- end 7 (unmap) to start 8 (kernel 1) 700 - 1400
- end 8 (kernel 1) to start 9 (kernel 2) 400 - 900
- end 9 (kernel 2) to start 10 (kernel 3) 400 - 700
- end 10 (kernel 3) to start 11 (map) 300 - 600
These times are huge for our system which operates on a short real-time cycle.
Does anyone have some insight into what might be causing this and how we could reduce the times? Some specifics of the system are given below in case they might help.
Thanks, Tony
Linux: Yocto from the Apollo Lake BSP release gold, build core-image-sato-sdk, installed on onboard eMMC.
Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 8GB DDR3 RAM
OpenCL: installed user space drivers from SRB4 https://software.intel.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Tony,
Could you provide a reproducer for the API sequence?
Latency seems to be too high ( especially delta #2,#3 and #4), therefore better understanding of exact API calls / events sequence and resource setup would help in this case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tony, this Intel presentation might be relevant to your work:
http://www.iwocl.org/wp-content/uploads/iwocl-2016-gpu-daemon.pdf
GPU daemon – Road to Zero Cost Submission
Michal Mrozek and Zbigniew Zdanowicz (Intel)
One of the biggest problems of OpenCL efficient usage is the latency submission. Time needed to pass through the driver stack is so significant that it limits the use of OpenCL on GPU in applications requiring low-latency. This presentation we present a novel approach utilizing new features of OpenCL 2.0 : Fine-Grained SVM and device enqueue_kernel that allows completely new usage models. We will present the idea of GPU daemon that operates using different modes (polling, enqueue_kernel and monitored_fence) and offers various levels of flexibility for the end user application. Part of presentation will show the data & code samples for each approach and will also compare each mode with the traditional submission model.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page