Segmentation fault on linux Red Hat with 1.2.0.10002 driver

Nicolas_B_1 · ‎06-09-2016

Hi,

I recently update my OpenCL driver for the latest 16.1 runtime and I am experimenting random segmentation fault after creating context for my CPU (Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz).
I am using OpenCL in java with JOCL, java binding for OpenCL provide by Jogamp (https://jogamp.org/).

I cannot provide you any replication step, because it happens with a very specific setup in and a very specific environment, but I can describe you the problem, maybe you are aware of such problem.

When my application starts, it lists all available devices, and for each device, it creates a context. With the previous driver version (1.2.0.9756), I never had any problem (at least like this one), but since, java crash sometimes few milliseconds after this task. When debugging the core with gdb, it seems that there is a memory corruption that leads to crash the jvm (corrupted stack):

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/java/jdk1.8.0_92/bin/java...(no debugging symbols found)...done.

(gdb) run
Starting program: /usr/java/jdk1.8.0_92/bin/java
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff7fdf700 (LWP 4347)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fdf700 (LWP 4347)]
0x00007fffe10002b4 in ?? ()
(gdb) where
#0  0x00007fffe10002b4 in ?? ()
#1  0x0000000000000246 in ?? ()
#2  0x00007fffe1000160 in ?? ()
#3  0x00007ffff7397250 in VM_Operation::_names () from /usr/java/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so
#4  0x00007ffff7fde980 in ?? ()
#5  0x00007ffff6ec8b9d in VM_Version::get_processor_features() () from /usr/java/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so

I am also using a NVidia drivers for my GPUs. When I remove the NVidia driver, the problem still occurs, but when I remove the Intel driver, I don't reproduce anymore, so I think that the problem comes from the latest.

I am on Redhat 7.2-9.

Do you have any idea about the cause of this problem ? Do you have beta drivers that I can test ?

Robert_I_Intel · ‎06-13-2016

Hi Nicolas,

From the stack dump you are providing it is hard to reach the conclusion that the OpenCL driver is at fault. If the previous driver version worked for you, is moving back to an old driver an option?

You could try to run this application under valgrind or purify to see if there are any memory leaks or memory corruption that is going on. I don't have many clues to go by as to how to replicate the issue you are experiencing.

Sorry,

Bram_L_ · ‎06-15-2016

Hello,

We did find the root cause of this crash.

The driver seems to make calls to signal functions (such as signal(), sigset(), sigaction()... ). In a JNI context, this can override the JVM's signal handling. Running the java process with -Xcheck:jni shows :

Warning: SIGSEGV handler expected:libjvm.so+0x918480  found:libOclCpuBackEnd.so+0x32b940
Signal Handlers:
SIGSEGV: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGBUS: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGFPE: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGPIPE: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGXFSZ: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGILL: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGUSR1: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGUSR2: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGHUP: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGINT: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGTERM: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
SIGQUIT: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER

This can be fixed by preloading a specific java library that does the channeling (ligjsig.so) like this : export LD_PRELOAD=<libjvm.so dir>/libjsig.so (See this link for more info : https://docs.oracle.com/javase/8/docs/technotes/guides/vm/signal-chaining.html)

Not sure why this was not happening with the previous version we had (maybe the signal handling in the driver is something recent ?)

Bram

Robert_I_Intel · ‎06-21-2016

Hi Bram,

I will contact the development team to see if they have any comments on this. Thanks for investigating and reporting this issue!