Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Segmentation fault in for_reentrancy_cleanup

Russcher__Martijn
475 Views

Hi everybody,

I am trying to get my head around a nasty segmentation fault. It occurs in a third party library, only after our program has finished. Running the software in gdb gives the following stacktrace:

#0  0x00007ffff01fbf40 in pthread_mutex_destroy () from /lib64/libpthread.so.0
#1  0x0000000000f30041 in kill_resource ()
#2  0x0000000000f2ffa1 in reentrancy_cleanup ()
#3  0x0000000000f2feae in for__reentrancy_cleanup ()
#4  0x00007ffff15ae8f5 in for__exit_handler () from /u/username/wrk/software/petsc-3.9.3_intel16/lib/libpetsc.so.3.9
#5  0x00000000004135a3 in main ()
#6  0x00007fffefe7dd1d in __libc_start_main () from /lib64/libc.so.6
#7  0x00000000004134a9 in _start ()

We are compiling with ifort 16.0.3.210. However, when compiling the same software (our application, but also libpetsc, the third party library) with ifort 14.0.3.174, we didn't experience this problem. Moreover, when checking the symbols in the library with nm, the older version does not even show routines such as for__reentrancy_cleanup() or kill_resource().

We have been trying very hard to isolate the error and prepare a minimal example, but so far this wasn't successful. The hints that we have are

  • the problem did not occur when building with version 14.0
  • it seems to be related to multi-threading/reentrancy
  • In particular, dropping the compiler option '-fopenmp' gets rid of the segmentation fault, but ideally we would like to keep this

Does anyone have a clue what is going on here? Has some default changed between 14.0.3.174 to 16.0.3.210 that could explain the above behavior? What generates the 'for__reentrancy_cleanup()'?

It would be great if someone can help out here and maybe give some pointers. Also, please let me know if you would need more information.

Thanks,

Martijn.

 

 

 

 

0 Kudos
9 Replies
Juergen_R_R
Valued Contributor I
475 Views

I think the full OpenMP Standard v4.0 was introduced in Ifort v15, so that is then really different. 

0 Kudos
Steve_Lionel
Honored Contributor III
475 Views

Symptom sounds like stack corruption to me. Very difficult to track down, especially in a threaded program.

0 Kudos
Russcher__Martijn
475 Views

Thank you for your comments.

In the mean time, I compiled with all OpenMP directives removed from the code. I was hoping that that would change things, but unfortunately the seg fault is still there. Following Steve's suggestion, I will start investigating this as a possible stack corruption, building the program with parts of the source code excluded.

However, maybe I should add that I have now also built a version with an 'empty main' where no routines are called or code is executed, but with full linking of all dependencies. The problem in this case persists...

To come up with a possible workaround, I would still be very interested in the nature of the routines 'reentrancy_cleanup' and 'for__reentrancy_cleanup'. Does anyone know what controls them? Why are they not present when building with Intel 14.0?

Regards,
Martijn.

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
475 Views

Prior to making code elimination experimentation, the first thing I would do is to generate a linker map to verify that your data segment size does not reach/exceed 2GB. If so, then you may have issues that need to be resolved (e.g. change large/huge static arrays to dynamic arrays)

Jim Dempsey

0 Kudos
Russcher__Martijn
475 Views

Thanks for the tip, Jim. I have now checked the linker map, and there's nothing out of the ordinary as far as I can see.

I continued with the code elimination process. The segmentation fault persisted up to the point where I masked out the last occurrence of a call to the intrinsic random_number subroutine. Knowing that this call was suspicious, I have now produced a minimal example which fails when compiled with mpif90 (MPICH2/ifort 16) and -fopenmp, and which succeeds when the -fopenmp flag is dropped or when we use ifort 14. This behavior is symptomatic for the full program. The minimal program is:

program fails
  use petsc

  real :: a

  print *, PETSC_COMM_WORLD

  call random_number(a)

end program

Running it in gdb, results in the exact same stacktrace as posted in my first message.

Led by the above information (i.e. knowing that "random_number" should be included in the search query) I found a similar issue posted to this forum some time ago:

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/707204

discussing a symbol conflict between libintlc.so and libc.so.6. Judging from the final comment by Martyn, It seems that our best chances are to investigate moving to compiler version 17. Do you agree that that would be the way to go? (or has there been any follow up on that topic that you know of?)

Regards,

Martijn Russcher

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
475 Views

If you can obtain a newer compiler, then do so. There have been numerous improvements (bug fixes and features) that make it worth while. I do suggest waiting on V18 (except for development testing) as you will find on this forum some issues that have been discovered.

It is good that you found the name collision issue. Have you experimented with specifically changing the link order to force the correct function to be loaded?

Jim Dempsey

0 Kudos
Russcher__Martijn
475 Views

Hi Jim,

I have been trying to manipulate things by setting the LD_PRELOAD variable with:

LD_PRELOAD = /lib64/libc.so.6:/path/to/intel/compiler/lib/intel64_lin/libintlc.so.5

Unfortunately, without any luck... The only thing that I can think of at this point, is to get the libintlc from v17 and put that in the path, as suggested in that topic by Martyn. That, and the upgrade to 18 for the longer term.

Regards,
Martijn.

0 Kudos
jimdempseyatthecove
Honored Contributor III
475 Views

The LD_PRELOAD is post-build. What you need to do is change the library search during build such that the affected symbol is identified with the desired library. IOW the .so files are used during build to identify which library to load at runtime.

Jim Dempsey

0 Kudos
Russcher__Martijn
475 Views

Hi Jim,

as a matter of fact, I tried both. First following the suggestion from this thread:

https://bugzilla.redhat.com/show_bug.cgi?id=1377895

and then also rebuilding executable and third party library with the library search order changed. Unfortunately, both had no success.

Martijn.

0 Kudos
Reply