I am trying to get my head around a nasty segmentation fault. It occurs in a third party library, only after our program has finished. Running the software in gdb gives the following stacktrace:
#0 0x00007ffff01fbf40 in pthread_mutex_destroy () from /lib64/libpthread.so.0
#1 0x0000000000f30041 in kill_resource ()
#2 0x0000000000f2ffa1 in reentrancy_cleanup ()
#3 0x0000000000f2feae in for__reentrancy_cleanup ()
#4 0x00007ffff15ae8f5 in for__exit_handler () from /u/username/wrk/software/petsc-3.9.3_intel16/lib/libpetsc.so.3.9
#5 0x00000000004135a3 in main ()
#6 0x00007fffefe7dd1d in __libc_start_main () from /lib64/libc.so.6
#7 0x00000000004134a9 in _start ()
We are compiling with ifort 184.108.40.206. However, when compiling the same software (our application, but also libpetsc, the third party library) with ifort 220.127.116.11, we didn't experience this problem. Moreover, when checking the symbols in the library with nm, the older version does not even show routines such as for__reentrancy_cleanup() or kill_resource().
We have been trying very hard to isolate the error and prepare a minimal example, but so far this wasn't successful. The hints that we have are
- the problem did not occur when building with version 14.0
- it seems to be related to multi-threading/reentrancy
- In particular, dropping the compiler option '-fopenmp' gets rid of the segmentation fault, but ideally we would like to keep this
Does anyone have a clue what is going on here? Has some default changed between 18.104.22.168 to 22.214.171.124 that could explain the above behavior? What generates the 'for__reentrancy_cleanup()'?
It would be great if someone can help out here and maybe give some pointers. Also, please let me know if you would need more information.
Thank you for your comments.
In the mean time, I compiled with all OpenMP directives removed from the code. I was hoping that that would change things, but unfortunately the seg fault is still there. Following Steve's suggestion, I will start investigating this as a possible stack corruption, building the program with parts of the source code excluded.
However, maybe I should add that I have now also built a version with an 'empty main' where no routines are called or code is executed, but with full linking of all dependencies. The problem in this case persists...
To come up with a possible workaround, I would still be very interested in the nature of the routines 'reentrancy_cleanup' and 'for__reentrancy_cleanup'. Does anyone know what controls them? Why are they not present when building with Intel 14.0?
Prior to making code elimination experimentation, the first thing I would do is to generate a linker map to verify that your data segment size does not reach/exceed 2GB. If so, then you may have issues that need to be resolved (e.g. change large/huge static arrays to dynamic arrays)
Thanks for the tip, Jim. I have now checked the linker map, and there's nothing out of the ordinary as far as I can see.
I continued with the code elimination process. The segmentation fault persisted up to the point where I masked out the last occurrence of a call to the intrinsic random_number subroutine. Knowing that this call was suspicious, I have now produced a minimal example which fails when compiled with mpif90 (MPICH2/ifort 16) and -fopenmp, and which succeeds when the -fopenmp flag is dropped or when we use ifort 14. This behavior is symptomatic for the full program. The minimal program is:
program fails use petsc real :: a print *, PETSC_COMM_WORLD call random_number(a) end program
Running it in gdb, results in the exact same stacktrace as posted in my first message.
Led by the above information (i.e. knowing that "random_number" should be included in the search query) I found a similar issue posted to this forum some time ago:
discussing a symbol conflict between libintlc.so and libc.so.6. Judging from the final comment by Martyn, It seems that our best chances are to investigate moving to compiler version 17. Do you agree that that would be the way to go? (or has there been any follow up on that topic that you know of?)
If you can obtain a newer compiler, then do so. There have been numerous improvements (bug fixes and features) that make it worth while. I do suggest waiting on V18 (except for development testing) as you will find on this forum some issues that have been discovered.
It is good that you found the name collision issue. Have you experimented with specifically changing the link order to force the correct function to be loaded?
I have been trying to manipulate things by setting the LD_PRELOAD variable with:
LD_PRELOAD = /lib64/libc.so.6:/path/to/intel/compiler/lib/intel64_lin/libintlc.so.5
Unfortunately, without any luck... The only thing that I can think of at this point, is to get the libintlc from v17 and put that in the path, as suggested in that topic by Martyn. That, and the upgrade to 18 for the longer term.
The LD_PRELOAD is post-build. What you need to do is change the library search during build such that the affected symbol is identified with the desired library. IOW the .so files are used during build to identify which library to load at runtime.
as a matter of fact, I tried both. First following the suggestion from this thread:
and then also rebuilding executable and third party library with the library search order changed. Unfortunately, both had no success.