OpenCL stall on Apollo Lake GPU

tony_w_ · ‎01-17-2017

Summary

When I run my app and select the GPU OpenCL device, the feeder thread stalls inside a blocking call to clEnqueueMapBuffer().

Preamble

Build: Yocto from the Apollo Lake BSP release gold,

Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 4GB DDR3 RAM (one SODIMM)

Build: core-image-sato-sdk

Installed on the onboard eMMC.

OpenCL: installed user space drivers from SRB4 https://software.intel.com/file/533571/download

I'm currently evaluating the Apollo Lake platform as a candidate to run our embedded application. We already have this application running on less powerful ARM based Linux systems with Mali GPU using OpenCL 1.2. We're now evaluating the E3950 as a faster alternative. To evaluate the application I need OpenCL 1.2 or later.

To verify the OpenCL installation I have built and run the Intel demo apps: CapsBasic and Bitonic Sort. CapsBasic sees two devices: CPU and GPU and Bitonic sort can run its kernels correctly on both the CPU and the GPU.

The issue

Simply put, the application has

thread 1 (feeder): has a loop that feeds data into OpenCL and queues kernels
thread 2 (consumer): waits for results and reads output data.
an OpenCL Host command queue with out-of-order execution enabled

When I run my app and select the GPU OpenCL device, the feeder thread stalls inside a blocking call to clEnqueueMapBuffer(). At this point only one thing has been queued on the command queue: a buffer unmap command for a different buffer. This unmap is waiting for an OpenCL event that will indicate data ready to be processed.

When I run my app and select the CPU OpenCL device, it works perfectly.

Does anyone have any ideas on

what might be causing this?
how to debug this on the Yocto platform?

I'm now working on a short reproducer that I can publish here.

Thanks,

Tony

tony_w_ · ‎01-18-2017

I have attached a reproducer for this issue and the text output it produces. Attached: source code, output text from the program. Compiled with gcc gpu_issue.c -o gpu_issue -L ./opt/intel/opencl -LOpenCL

Note that there is no output after the call to map buffer 2. If I modify the code to select a CPU device then the call to map buffer 2 succeeds.

Michal_M_Intel · ‎01-19-2017

Thank you for your report, I can confirm this is a GPU driver problem.

We are looking into possible solutions for it, so it may be difficult to provide timeline for the fix at the moment.

In the meantime if you could provide more information about what do you want to accomplish, then I may be able to provide another solution for your use case.

tony_w_ · ‎01-19-2017

Thanks for the information and the offer to help us find a way to work around the issue. Below is an outline of the processing constraints we need to satisfy.

Our system processes a real-time data stream on a very short time cycle (in the ms region) so low latency processing is as important as raw speed. To allow for varying processing latency we use multiple input and output buffers in a cyclic fashion. Also, on the OpenCL implementation used on another platform (ARM Mali) we found we could reduced latency by queueing tasks ahead and using the out-of-order queueing feature.

Would you be able to give us more information about the issue and what we must avoid doing?

Michal_M_Intel · ‎01-19-2017

What is happening in the code is that MapBuffer is called with blocking_flag set to True on an Out Of Order Queue.

There are no input events, so for the driver it means, map this buffer for me now, even if it may be in use by GPU, please confirm that this is expected.

If you want such access to buffer storage and synchronization is not needed, then you may have another out of order queue on which this MapBuffer operation will actually happen. Currently driver improperly waits for the blocked unMap operation to complete prior to servicing MapBuffer call, this wait shouldn't be present in out of order queue.

If you want to actually synchronize on the previous unMap call, then code should use events.

tony_w_ · ‎01-19-2017

Thanks Michal. Yes, the map now is intended. Now that I understand what the issue is, I have been able reorder a couple of things to avoid this problem, so I have our application running on the GPU. Unfortunately, not fast enough though, but I'll open a new topic to ask for help on that.