OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

OpenCL stall on Apollo Lake GPU

tony_w_
Beginner
250 Views
Summary
When I run my app and select the GPU OpenCL device, the feeder thread stalls inside a blocking call to clEnqueueMapBuffer(). 
 
Preamble
Build: Yocto from the Apollo Lake BSP release gold, 
Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 4GB DDR3 RAM (one SODIMM)
Build: core-image-sato-sdk
Installed on the onboard eMMC.
OpenCL: installed user space drivers from SRB4 https://software.intel.com/file/533571/download
 
I'm currently evaluating the Apollo Lake platform as a candidate to run our embedded application. We already have this application running on less powerful ARM based Linux systems with Mali GPU using OpenCL 1.2. We're now evaluating the E3950 as a faster alternative. To evaluate the application I need OpenCL 1.2 or later.
 
To verify the OpenCL installation I have built and run the Intel demo apps: CapsBasic and Bitonic Sort. CapsBasic sees two devices: CPU and GPU and Bitonic sort can run its kernels correctly on both the CPU and the GPU. 
 
The issue
Simply put, the application has 
  • thread 1 (feeder): has a loop that feeds data into OpenCL and queues kernels
  • thread 2 (consumer): waits for results and reads output data. 
  • an OpenCL Host command queue with out-of-order execution enabled
When I run my app and select the GPU OpenCL device, the feeder thread stalls inside a blocking call to clEnqueueMapBuffer(). At this point only one thing has been queued on the command queue: a buffer unmap command for a different buffer. This unmap is waiting for an OpenCL event that will indicate data ready to be processed.
 
When I run my app and select the CPU OpenCL device, it works perfectly.
 
Does anyone have any ideas on
  1. what might be causing this?
  2. how to debug this on the Yocto platform?
I'm now working on a short reproducer that I can publish here.
 
Thanks,
 
Tony
0 Kudos
5 Replies
tony_w_
Beginner
250 Views

I have attached a reproducer for this issue and the text output it produces.  Attached: source code, output text from the program. Compiled with gcc gpu_issue.c -o gpu_issue -L ./opt/intel/opencl -LOpenCL

Note that there is no output after the call to map buffer 2. If I modify the code to select a CPU device then the call to map buffer 2 succeeds.

 

Michal_M_Intel
Employee
250 Views

Thank you for your report, I can confirm this is a GPU driver problem.

We are looking into possible solutions for it, so it may be difficult to provide timeline for the fix at the moment.

In the meantime if you could provide more information about what do you want to accomplish, then I may be able to provide another solution for your use case.

tony_w_
Beginner
250 Views

Thanks for the information and the offer to help us find a way to work around the issue. Below is an outline of the processing constraints we need to satisfy.

Our system processes a real-time data stream on a very short time cycle (in the ms region) so low latency processing is as important as raw speed. To allow for varying processing latency we use multiple input and output buffers in a cyclic fashion. Also, on the OpenCL implementation used on another platform (ARM Mali) we found we could reduced latency by queueing tasks ahead and using the out-of-order queueing feature.

Would you be able to give us more information about the issue and what we must avoid doing?

 

Michal_M_Intel
Employee
250 Views

What is happening in the code is that MapBuffer is called with blocking_flag set to True on an Out Of Order Queue.

There are no input events, so for the driver it means, map this buffer for me now, even if it may be in use by GPU, please confirm that this is expected.

If you want such access to buffer storage and synchronization is not needed, then you may have another out of order queue on which this MapBuffer operation will actually happen. Currently driver improperly waits for the blocked unMap operation to complete prior to servicing MapBuffer call, this wait shouldn't be present in out of order queue.

If you want to actually synchronize on the previous unMap call, then code should use events.

tony_w_
Beginner
250 Views

Thanks Michal. Yes, the map now is intended. Now that I understand what the issue is, I have been able reorder a couple of things to avoid this problem, so I have our application running on the GPU. Unfortunately, not fast enough though, but I'll open a new topic to ask for help on that.

 

Reply