OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

How can I reduce start latencies with OpenCL on the GPU?

tony_w_
Beginner
1,065 Views

I'm evaluating an Intel platform for an embedded real-time processor in our systems. Our application uses OpenCL to prcoess incoming data on a very short cycle in real-time. It is critical to the system that it is able to keep up with the input data stream. Latency between input and output is also critical so we are not able to batch up data and process it in larger quantities. For these reasons, start latency for tasks on the OpenCL command queue is as critical as kernel processing speed.

One processing cycle looks something like this. Steps 7 to 11 are all using events to trigger the next step.

  1. Map input buffer
  2. Queue unmap input buffer (to be triggered by a user event)
  3. Queue kernels
  4. Queue map output buffer
  5. Copy data in
  6. Trigger unmap
  7. Unmap
  8. Kernel 1
  9. Kernel 2
  10. Kernel 3
  11. Map output buffer
  12. Copy data out

This sequence works very well on OpenCL on a different (non-Intel) processor but seems to suffer longer start latency than expected on this processor. Examples of latency (microseconds) between the some of these steps is shown below.

  • end 7 (unmap) to start 8 (kernel 1)    700 - 1400
  • end 8 (kernel 1) to start 9 (kernel 2)   400 - 900
  • end 9 (kernel 2) to start 10 (kernel 3)    400 - 700
  • end 10 (kernel 3) to start 11 (map)    300 - 600

These times are huge for our system which operates on a short real-time cycle. 

Does anyone have some insight into what might be causing this and how we could reduce the times? Some specifics of the system are given below in case they might help.

Thanks, Tony

Linux: Yocto from the Apollo Lake BSP release gold, build core-image-sato-sdk, installed on onboard eMMC.

Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 8GB DDR3 RAM

OpenCL: installed user space drivers from SRB4 https://software.intel.com/file/533571/download

0 Kudos
2 Replies
Michal_M_Intel
Employee
1,065 Views

Hello Tony,

Could you provide a reproducer for the API sequence?

Latency seems to be too high ( especially delta #2,#3 and #4), therefore better understanding of exact API calls / events sequence and resource setup would help in this case.

 

 

 

0 Kudos
allanmac1
Beginner
1,065 Views

Tony, this Intel presentation might be relevant to your work:

http://www.iwocl.org/wp-content/uploads/iwocl-2016-gpu-daemon.pdf

 

GPU daemon – Road to Zero Cost Submission

Michal Mrozek and Zbigniew Zdanowicz (Intel)

 

One of the biggest problems of OpenCL efficient usage is the latency submission. Time needed to pass through the driver stack is so significant that it limits the use of OpenCL on GPU in applications requiring low-latency. This presentation we present a novel approach utilizing new features of OpenCL 2.0 : Fine-Grained SVM and device enqueue_kernel that allows completely new usage models. We will present the idea of GPU daemon that operates using different modes (polling, enqueue_kernel and monitored_fence) and offers various levels of flexibility for the end user application. Part of presentation will show the data & code samples for each approach and will also compare each mode with the traditional submission model.

0 Kudos
Reply