Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
38 Views

Error '[drm] stuck on render ring' when running 'gemm' sample code - CentOS 7.1 - kernel 3.10.0-229.el7.centos.intel.sr1.x86_64

  • Hardware: Gigabyte Brix GB-BXi7-4770R
  • CPU info: Intel(R) Core(TM) i7-4770R CPU @ 3.20GHz (Gigabyte Brix)
  • GPU: Intel Iris Pro 5200 (integrated)
  • OS: Linux CentOS 7.1
  • Kernel: 3.10.0-229.el7.centos.intel.sr1.x86_64 (patched with the patch 'kernel-3.10.0-229.patch' included in 'intel-opencl-1.2-1.0-47971.tar.gz' following the instructions in 'intel-opencl-1.2-installation-external.pdf')
  • Compiled the intel OpenCL samples using gcc 4.8.3
  • The first sample 'CapBasic' runs without errors and generates the following output:
Number of available platforms: 1
Platform names:
    [0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
    CL_DEVICE_TYPE_CPU: 0
    CL_DEVICE_TYPE_GPU: 1
    CL_DEVICE_TYPE_ACCELERATOR: 0

*** Detailed information for each device ***

CL_DEVICE_TYPE_GPU[0]
    CL_DEVICE_NAME: Intel(R) HD Graphics
    CL_DEVICE_AVAILABLE: 1
    CL_DEVICE_VENDOR: Intel(R) Corporation
    CL_DEVICE_PROFILE: FULL_PROFILE
    CL_DEVICE_VERSION: OpenCL 1.2 
    CL_DRIVER_VERSION: 1.0.47971
    CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 
    CL_DEVICE_MAX_COMPUTE_UNITS: 40
    CL_DEVICE_MAX_CLOCK_FREQUENCY: 1300
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
    CL_DEVICE_ADDRESS_BITS: 64
    CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024
    CL_DEVICE_MAX_MEM_ALLOC_SIZE: 427399577
    CL_DEVICE_GLOBAL_MEM_SIZE: 1709598311
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 427399577
    CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 524288
    CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
    CL_DEVICE_LOCAL_MEM_SIZE: 65536
    CL_DEVICE_PROFILING_TIMER_RESOLUTION: 80
    CL_DEVICE_IMAGE_SUPPORT: 1
    CL_DEVICE_ERROR_CORRECTION_SUPPORT: 0
    CL_DEVICE_HOST_UNIFIED_MEMORY: 1
    CL_DEVICE_EXTENSIONS: cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_motion_estimation cl_intel_subgroups cl_intel_va_api_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir 
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 0
    CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: 0

 

  • However when I try to run the 'GEMM' sample, it hangs after the following few lines:
Platforms (1):
    [0] Intel(R) OpenCL [Selected]
Devices (1):
    [0] Intel(R) HD Graphics [Selected]

 

  • At the same time in '/var/log/messages', I see the following:
Jan 17 21:45:40 centos71 kernel: [drm] GPU HANG: ecode 0:0x8fd0ffff, in gemm [25
56], reason: Ring hung, action: reset
Jan 17 21:45:42 centos71 kernel: [drm] Enabling RC6 states: RC6 on, RC6p off, RC
6pp off
Jan 17 21:45:42 centos71 kernel: ------------[ cut here ]------------
Jan 17 21:45:42 centos71 kernel: WARNING: at drivers/gpu/drm/i915/intel_pm.c:3432 gen6_enable_rps_interrupts+0xa3/0xb0 [i915]()
Jan 17 21:45:42 centos71 kernel: Modules linked in: ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables snd_hda_codec_realtek intel_powerclamp snd_hda_codec_hdmi snd_hda_codec_generic coretemp kvm_intel kvm snd_hda_intel snd_hda_controller snd_hda_codec btusb crct10dif_pclmul crc32_pclmul crc32c_intel snd_hwdep bluetooth i915 ghash_clmulni_intel snd_seq aesni_intel snd_seq_device r8169 lrw gf128mul glue_helper rfkill snd_pcm ablk_helper cryptd mii i2c_algo_bit snd_timer iTCO_wdt drm_kms_helper
Jan 17 21:45:42 centos71 kernel: snd iTCO_vendor_support sdhci_acpi drm soundcore sdhci mei_me mmc_core shpchp mei lpc_ich video mfd_core i2c_i801 i2c_hid i2c_core pcspkr nls_utf8 isofs loop xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common ahci libahci libata dm_mirror dm_region_hash dm_log dm_mod
Jan 17 21:45:42 centos71 kernel: CPU: 0 PID: 71 Comm: kworker/0:1 Not tainted 3.10.0-229.el7.centos.intel.sr1.x86_64 #1
Jan 17 21:45:42 centos71 kernel: Hardware name: GIGABYTE M4HM87P-00/M4HM87P-00, BIOS F5 06/23/2014
Jan 17 21:45:42 centos71 kernel: Workqueue: events intel_gen6_powersave_work [i915]
Jan 17 21:45:42 centos71 kernel: 0000000000000000 000000005b114aaa ffff880407ecbd58 ffffffff81603f36
Jan 17 21:45:42 centos71 kernel: ffff880407ecbd90 ffffffff8106e28b ffff880406430000 ffff880406437108
Jan 17 21:45:42 centos71 kernel: 0000000000040000 ffff880406435820 ffff880406430000 ffff880407ecbda0
Jan 17 21:45:42 centos71 kernel: Call Trace:
Jan 17 21:45:42 centos71 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b
Jan 17 21:45:42 centos71 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0
Jan 17 21:45:42 centos71 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20
Jan 17 21:45:42 centos71 kernel: [<ffffffffa03e5263>] gen6_enable_rps_interrupts+0xa3/0xb0 [i915]
Jan 17 21:45:42 centos71 kernel: [<ffffffffa03ea32e>] intel_gen6_powersave_work+0x39e/0xd80 [i915]
Jan 17 21:45:42 centos71 kernel: [<ffffffff8108f0ab>] process_one_work+0x17b/0x470
Jan 17 21:45:42 centos71 kernel: [<ffffffff8108fe8b>] worker_thread+0x11b/0x400
Jan 17 21:45:42 centos71 kernel: [<ffffffff8108fd70>] ? rescuer_thread+0x400/0x400
Jan 17 21:45:42 centos71 kernel: [<ffffffff8109726f>] kthread+0xcf/0xe0
Jan 17 21:45:42 centos71 kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Jan 17 21:45:42 centos71 kernel: [<ffffffff81613cfc>] ret_from_fork+0x7c/0xb0
Jan 17 21:45:42 centos71 kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Jan 17 21:45:42 centos71 kernel: ---[ end trace fb836458f742c30c ]---

  • I saved the GPU crash dump from '/sys/class/drm/card0/error' in case you are interested in it
  • I also tried compiling the stock 4.1 Linux kernel and using the patch provided for the 4.1 kernel, but the results are similar.

Any help or idea of what's going on here is appreciated.

Thanks in advance,

Franco Venturi

0 Kudos
3 Replies
Highlighted
Employee
38 Views

Hi Franco,

Here is the response from our Linux developer:

GEMM on my system has always been one of those long running applications requiring the hang check to be disabled.  All of our release notes talk about to some degree.  Our latest in SRB1 is pretty much unchanged from previous versions.  I do not see the call trace, but I suspect this will resolve the issue for them:

 

-   For workloads that take longer than 1.5 seconds the i915 hang check

    will reset the GPU, output a kernel message for logging, and clear

    any pending work items. When necessary, the i915 hang check can be

    disabled on demand with

 

        $ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'

 

    Although the GPU will no longer reset when executing with hang

    checks disabled, sufficiently large workloads may stall other GPU

    tasks such as screen updates. These situations can be recovered from

    by manually resetting the GPU with

 

        $ sudo bash -c 'echo 1 > /sys/kernel/debug/dri/0/i915_wedged'

 

 We also describe this in our release notes.

 

0 Kudos
Highlighted
Beginner
38 Views

Thanks for the reply Robert.

I went ahead and I disabled the hangcheck and this time 'gemm' ran to completion without warnings or errors (I did notice that the screen seemed to be frozen while 'gemm' was running, but I think that is to be expected).

I apologize for not having seen that important info in the release note (I'll go ahead and read them tonight in case I missed other important information).

Thanks again,

Franco

0 Kudos
Highlighted
Beginner
38 Views

Is it possible to set hangcheck on osx? What would the command look like?

0 Kudos