- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I recently update my OpenCL driver for the latest 16.1 runtime and I am experimenting random segmentation fault after creating context for my CPU (Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz).
I am using OpenCL in java with JOCL, java binding for OpenCL provide by Jogamp (https://jogamp.org/).
I cannot provide you any replication step, because it happens with a very specific setup in and a very specific environment, but I can describe you the problem, maybe you are aware of such problem.
When my application starts, it lists all available devices, and for each device, it creates a context. With the previous driver version (1.2.0.9756), I never had any problem (at least like this one), but since, java crash sometimes few milliseconds after this task. When debugging the core with gdb, it seems that there is a memory corruption that leads to crash the jvm (corrupted stack):
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/java/jdk1.8.0_92/bin/java...(no debugging symbols found)...done. (gdb) run Starting program: /usr/java/jdk1.8.0_92/bin/java [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff7fdf700 (LWP 4347)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff7fdf700 (LWP 4347)] 0x00007fffe10002b4 in ?? () (gdb) where #0 0x00007fffe10002b4 in ?? () #1 0x0000000000000246 in ?? () #2 0x00007fffe1000160 in ?? () #3 0x00007ffff7397250 in VM_Operation::_names () from /usr/java/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so #4 0x00007ffff7fde980 in ?? () #5 0x00007ffff6ec8b9d in VM_Version::get_processor_features() () from /usr/java/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so
I am also using a NVidia drivers for my GPUs. When I remove the NVidia driver, the problem still occurs, but when I remove the Intel driver, I don't reproduce anymore, so I think that the problem comes from the latest.
I am on Redhat 7.2-9.
Do you have any idea about the cause of this problem ? Do you have beta drivers that I can test ?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
From the stack dump you are providing it is hard to reach the conclusion that the OpenCL driver is at fault. If the previous driver version worked for you, is moving back to an old driver an option?
You could try to run this application under valgrind or purify to see if there are any memory leaks or memory corruption that is going on. I don't have many clues to go by as to how to replicate the issue you are experiencing.
Sorry,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We did find the root cause of this crash.
The driver seems to make calls to signal functions (such as signal(), sigset(), sigaction()... ). In a JNI context, this can override the JVM's signal handling. Running the java process with -Xcheck:jni shows :
Warning: SIGSEGV handler expected:libjvm.so+0x918480 found:libOclCpuBackEnd.so+0x32b940 Signal Handlers: SIGSEGV: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGBUS: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGFPE: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGPIPE: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGXFSZ: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGILL: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGUSR1: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGUSR2: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGHUP: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGINT: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGTERM: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER SIGQUIT: [libOclCpuBackEnd.so+0x32b940], sa_mask[0]=00000000000000000000000000000000, sa_flags=SA_RESETHAND|SA_NODEFER
This can be fixed by preloading a specific java library that does the channeling (ligjsig.so) like this : export LD_PRELOAD=<libjvm.so dir>/libjsig.so (See this link for more info : https://docs.oracle.com/javase/8/docs/technotes/guides/vm/signal-chaining.html)
Not sure why this was not happening with the previous version we had (maybe the signal handling in the driver is something recent ?)
Bram
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bram,
I will contact the development team to see if they have any comments on this. Thanks for investigating and reporting this issue!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page