OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

Is there a driver watchdog time limit for Intel GPU on Linux?

QFang1
Novice
849 Views

I am wondering if there is a watchdog time limit when running an OpenCL program on a non-dedicated Intel GPU, i.e. the graphics card is connected to a monitor?

Particularly, I have a laptop with Intel i7-6500U processor (HD 5500 graphics) running Ubuntu 14.04. I've installed the latest GPU/CPU drivers as well as the OpenCL SDK. When I run my OpenCL program on the CPU target, everything works well. However, when running on the HD GPU, the code hangs when running larger number of loads (run time longer than 5 seconds). I could not find anything obvious in the code to cause this behavior, wondering if it is due to the driver.

If such limit exists, is there a way for me to disable it?

thanks

0 Kudos
6 Replies
Jeffrey_M_Intel1
Employee
849 Views

Sorry for the delayed reply.  As far as I know there is nothing "built in" to cause a hang like you've described.

Ideally, to proceed with the investigation we will need a small/simple reproducer that you can share without intellectual property concerns.  

However, the first thing to check is if the new Media Server 2017 release helps.  

 

QFang1
Novice
849 Views

hi Jeffrey, thanks for your comment. it is good to know there is no time limit when running a kernel on Intels GPUs.

I posted the hanging problem previously in the below thread:

https://software.intel.com/en-us/forums/opencl/topic/675126

My software is open-source, so you should be able to checkout the source code, compile and reproduce the issue with the commands provided in the above link.

I tested my code over a range of GPUs and CPUs, I noticed that only two devices gave me this hanging problem: Intel HD 5500 GPU (from an i7-5600U CPU), and an AMD Fiji GPU (R9 Nano). I recently fixed the hanging problem on the Fiji, by replacing the clWaitForEvents() to clFinish(), as in this commit

https://github.com/fangq/mcxcl/commit/135dc825e2905253ab0626a2b335dfee8b6e741e

however, Intel GPU still hangs even with this change. I tried the gdb debugging tool, but it could not break inside the kernel, and I could not find out what was holding the program. 

If you can take a quick look and let me know how to debug this problem, that would be greatly appreciated!

In addition, I also noticed a significant speed drop after upgrading to the latest SDK/GPU driver. When I used the previous GPU driver (a patched 4.1 Linux kernel), I was able to get 1000 photon/ms when running 1e5-1e6 photons ("-n 1e5" in the command line). However, with the latest driver (a patched 4.4 kernel), the speed dropped to 80 photon/ms when running a small load (-n 1e5 or -n 1e6). In both cases, the kernel hangs when running for larger number of photons (-n 5e6 or -n 1e7). The code runs smoothly on all Intel CPUs, giving 150 photon/ms (i7-5600U) to 400 photon/ms (i7-6700K). I am also appreciated if you can share any insight to this issue.

Jeffrey_M_Intel1
Employee
849 Views

I can replicate that your application runs with -n 1e6 but hangs with -n ie7 on the GPU.  Will take a look at the code and get back to you in a day or two.

Usually we recommend a small reproducer in these guidelines but it looks like the OpenCL code is mostly in the relatively short mcx_host file.  Is there anywhere else to look?

 

QFang1
Novice
849 Views

hi Jeff, thank you so much for looking into this.

Can you explain a little bit on the "small reproducer"? what is it? were you referring to the opencl kernel? if yes, the cl kernel is a file called mcx_core.cl:

https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl

the capsbasic example attached looks like a "deviceQuery" type of code that enumerates the devices. If this is what you are looking for, you can use

mcxcl -L

to do the same. The actual source code for this feature can be found here

https://github.com/fangq/mcxcl/blob/master/src/mcx_host.cpp#L113-L220

 

let me know if I understood your question correctly or not. thanks again.

Jeffrey_M_Intel1
Employee
849 Views

By "small reproducer", I mean a very small standalone application (usually not your full application) which shows the issue.  However, since your OpenCL code is localized to just a few lines we should be able to proceed with this application. 

QFang1
Novice
849 Views

just want to follow up with this 1-yr old thread. after many googling, I found that Intel GPU linux drive indeed has a time limit (~10 second), enabled by the hangcheck parameter. To disable this time limit in order to run OpenCL kernel for longer period of time, you need to run

echo -n 0 > /sys/module/i915/parameters/enable_hangcheck

as root, or sudo nano /sys/module/i915/parameters/enable_hangcheck and replace Y to 0. 

after replacing this flag, OpenCL kernel can run for longer period of time without being killed after 10 second run time.

to see if your GPU had hanged, type dmesg after the kernel hangs. related links

https://www.freedesktop.org/wiki/Software/Beignet/

PS: looks like in the latest GPU driver (intel-opencl-r5.0) release note, this trick is provided for 4.7 kernel patch on page 6

http://registrationcenter-download.intel.com/akdlm/irc_nas/11396/SRB5.0_intel-opencl-release-notes.p...

Reply