Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Intel Ifort 15 + OpenMPI 1.8.4 + OpenMP= instantaneous segfault

schaefer__brandon
1,616 Views

Hi,

using Intel ifort 15, I compiled OpenMPI 1.8.4 using the following configure line:

../configure --prefix=<path to installdir>  --with-openib --with-sge CC=icc FC=ifort CXX=icpc

Unfortunately, compiling our hybrid MPI + OpenMP code with the resulting MPI compiler wrapper results in a binary which segfaults instantaneously after startup.

Please consider the following example which shows the same behavior as our large code:
program test
    use mpi
    integer :: ierr
    real(8) :: a
    call mpi_init(ierr)
    call random_number(a)
    write(*,*)"hello"
    call mpi_finalize(ierr)
end program test


Compiler Versions:
 >mpif90 --version
ifort (IFORT) 15.0.1 20141023
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
>icc --version
icc (ICC) 15.0.1 20141023
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
>icpc --version
icpc (ICC) 15.0.1 20141023
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

How to compile:
>mpif90 -openmp test.f90

Error:
>./a.out
Segmentation fault
 

Please note that this bug has already been reported to the OpenMPI team and they seemed to come to the conclusion that this bug is one the Ifort side ( http://www.open-mpi.org/community/lists/users/2014/11/25834.php )
 

Best,
Bastian

0 Kudos
26 Replies
Steven_L_Intel1
Employee
1,310 Views

I am not seeing any evidence in the thread you linked to that it is a compiler bug, just a "toss it over the wall" speculation. The program seems to work fine with Intel MPI. Do you have anything more concrete to suggest it is a compiler issue? Can you tell where it is getting the segfault?

0 Kudos
schaefer__brandon
1,310 Views

Steve,

please find here the specific post http://www.open-mpi.org/community/lists/users/2014/11/25837.php that suggested to file a bug report here.

Debug flags do not reveal any additional information:

>mpif90 -check all -traceback -fp-stack-check -fpe0 -fpe-all=0 -ftrapuv -fstack-protector-all -mp1 -g -no-opt-assume-safe-padding -openmp test.f90
>./a.out
Segmentation fault

However, gdb suggest that it might be a problem in init_resource ():

>gdb ./a.out
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /kernph/schaba00/programs/openmpi-1.8.4_Intel15/a.out...done.
(gdb) run
Starting program: /kernph/schaba00/programs/openmpi-1.8.4_Intel15/a.out
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x000000000040b984 in init_resource ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.4.x86_64 libgcc-4.4.7-4.el6.x86_64 libxml2-2.7.6-14.el6_5.2.x86_64 zlib-1.2.3-29.el6.x86_64

Best,
Bastian

 

 

0 Kudos
Steven_L_Intel1
Employee
1,310 Views

I saw that - but there's no evidence it's a compiler bug. init_resource is not our routine.

By the way, don't bother using -ftrapuv for anything.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,310 Views

Out of curiosity, what happens with your #1 test program when you comment out the call random_number and write statements (make no other changes)?

What I am probing at is something that may be related to a different thread. One where the user called omp_get_wtime() before the first parallel region, then issued a fork() causing child processes to inherit an initialize OpenMP environment. His symptom was segfault.

Note, although your little test program is not calling an OpenMP library routine, it may be interacting with OpenMP in an unexpected manner. You should consider single stepping through a disassembly window, first over function calls then into, as you locate where errors occur.

I think Steve is on to something with "init_resource is not our routine" You may need to discover where this call is made from.

Jim Dempsey

0 Kudos
schaefer__brandon
1,310 Views

Hi Jim,

when the call random_number and the write statement are commented, the program runs without any error. The segfault is onyl present, if also the call to the random number generator is present.

The below code will also produce the segfault (write statement commented). Please find below also the corresponding assembly code.

program test
    use mpi
    integer :: ierr
    real(8) :: a
    call mpi_init(ierr)
    call random_number(a)
!    write(*,*)"hello"
    call mpi_finalize(ierr)
end program test
# mark_description "Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.1.133 Build 2";
# mark_description "0141023";
	.file "test.f90"
	.text
..TXTST0:
# -- Begin  MAIN__
	.text
# mark_begin;
       .align    16,0x90
	.globl MAIN__
MAIN__:
..B1.1:                         # Preds ..B1.0
..___tag_value_MAIN__.1:                                        #test.f90:1.9
        pushq     %rbp                                          #test.f90:1.9
..___tag_value_MAIN__.3:                                        #
        movq      %rsp, %rbp                                    #test.f90:1.9
..___tag_value_MAIN__.4:                                        #
        andq      $-128, %rsp                                   #test.f90:1.9
        subq      $128, %rsp                                    #test.f90:1.9
        xorl      %esi, %esi                                    #test.f90:1.9
        movl      $3, %edi                                      #test.f90:1.9
        call      __intel_new_feature_proc_init                 #test.f90:1.9
                                # LOE rbx r12 r13 r14 r15
..B1.10:                        # Preds ..B1.1
        stmxcsr   (%rsp)                                        #test.f90:1.9
        movl      $.2.4_2_kmpc_loc_struct_pack.1, %edi          #test.f90:1.9
        xorl      %esi, %esi                                    #test.f90:1.9
        orl       $32832, (%rsp)                                #test.f90:1.9
        xorl      %eax, %eax                                    #test.f90:1.9
        ldmxcsr   (%rsp)                                        #test.f90:1.9
..___tag_value_MAIN__.6:                                        #test.f90:1.9
        call      __kmpc_begin                                  #test.f90:1.9
..___tag_value_MAIN__.7:                                        #
                                # LOE rbx r12 r13 r14 r15
..B1.2:                         # Preds ..B1.10
        movl      $__NLITPACK_0.0.1, %edi                       #test.f90:1.9
        call      for_set_reentrancy                            #test.f90:1.9
                                # LOE rbx r12 r13 r14 r15
..B1.3:                         # Preds ..B1.2
        lea       (%rsp), %rdi                                  #test.f90:5.10
..___tag_value_MAIN__.8:                                        #test.f90:5.10
        call      mpi_init_                                     #test.f90:5.10
..___tag_value_MAIN__.9:                                        #
                                # LOE rbx r12 r13 r14 r15
..B1.4:                         # Preds ..B1.3
        call      for_random_number                             #test.f90:6.10
                                # LOE rbx r12 r13 r14 r15
..B1.5:                         # Preds ..B1.4
        lea       (%rsp), %rdi                                  #test.f90:8.10
..___tag_value_MAIN__.10:                                       #test.f90:8.10
        call      mpi_finalize_                                 #test.f90:8.10
..___tag_value_MAIN__.11:                                       #
                                # LOE rbx r12 r13 r14 r15
..B1.6:                         # Preds ..B1.5
        movl      $.2.4_2_kmpc_loc_struct_pack.12, %edi         #test.f90:9.1
        xorl      %eax, %eax                                    #test.f90:9.1
..___tag_value_MAIN__.12:                                       #test.f90:9.1
        call      __kmpc_end                                    #test.f90:9.1
..___tag_value_MAIN__.13:                                       #
                                # LOE rbx r12 r13 r14 r15
..B1.7:                         # Preds ..B1.6
        movl      $1, %eax                                      #test.f90:9.1
        movq      %rbp, %rsp                                    #test.f90:9.1
        popq      %rbp                                          #test.f90:9.1
..___tag_value_MAIN__.14:                                       #
        ret                                                     #test.f90:9.1
        .align    16,0x90
..___tag_value_MAIN__.16:                                       #
                                # LOE
# mark_end;
	.type	MAIN__,@function
	.size	MAIN__,.-MAIN__
	.data
	.align 4
	.align 4
.2.4_2_kmpc_loc_struct_pack.1:
	.long	0
	.long	2
	.long	0
	.long	0
	.quad	.2.4_2__kmpc_loc_pack.0
	.align 4
.2.4_2__kmpc_loc_pack.0:
	.byte	59
	.byte	117
	.byte	110
	.byte	107
	.byte	110
	.byte	111
	.byte	119
	.byte	110
	.byte	59
	.byte	77
	.byte	65
	.byte	73
	.byte	78
	.byte	95
	.byte	95
	.byte	59
	.byte	49
	.byte	59
	.byte	49
	.byte	59
	.byte	59
	.space 3, 0x00 	# pad
	.align 4
.2.4_2_kmpc_loc_struct_pack.12:
	.long	0
	.long	2
	.long	0
	.long	0
	.quad	.2.4_2__kmpc_loc_pack.11
	.align 4
.2.4_2__kmpc_loc_pack.11:
	.byte	59
	.byte	117
	.byte	110
	.byte	107
	.byte	110
	.byte	111
	.byte	119
	.byte	110
	.byte	59
	.byte	77
	.byte	65
	.byte	73
	.byte	78
	.byte	95
	.byte	95
	.byte	59
	.byte	57
	.byte	59
	.byte	57
	.byte	59
	.byte	59
	.section .rodata, "a"
	.align 8
	.align 8
__NLITPACK_0.0.1:
	.long	0x00000002,0x00000000
	.data
# -- End  MAIN__
	.data
	.comm mpi_fortran_bottom_,4,32
	.comm mpi_fortran_in_place_,4,32
	.comm mpi_fortran_argv_null_,1,32
	.comm mpi_fortran_argvs_null_,1,32
	.comm mpi_fortran_errcodes_ignore_,4,32
	.comm mpi_fortran_status_ignore_,24,32
	.comm mpi_fortran_statuses_ignore_,24,32
	.comm mpi_fortran_unweighted_,4,32
	.comm mpi_fortran_weights_empty_,4,32
	.section .note.GNU-stack, ""
// -- Begin DWARF2 SEGMENT .eh_frame
	.section .eh_frame,"a",@progbits
.eh_frame_seg:
	.align 8
	.4byte 0x00000014
	.8byte 0x7801000100000000
	.8byte 0x0000019008070c10
	.4byte 0x00000000
	.4byte 0x00000034
	.4byte 0x0000001c
	.8byte ..___tag_value_MAIN__.1
	.8byte ..___tag_value_MAIN__.16-..___tag_value_MAIN__.1
	.byte 0x04
	.4byte ..___tag_value_MAIN__.3-..___tag_value_MAIN__.1
	.2byte 0x100e
	.byte 0x04
	.4byte ..___tag_value_MAIN__.4-..___tag_value_MAIN__.3
	.4byte 0x8610060c
	.2byte 0x0402
	.4byte ..___tag_value_MAIN__.14-..___tag_value_MAIN__.4
	.8byte 0x00000000c608070c
	.2byte 0x0000
# End

 

0 Kudos
Steven_L_Intel1
Employee
1,310 Views

Can you do a WRITE before the call to mpi_init and see the output? My guess is that the segfault is happening inside the call to mpi_init. 

RANDOM_NUMBER uses thread-local storage and maybe there's something in OpenMPI that doesn't like that - just a guess.

0 Kudos
schaefer__brandon
1,310 Views

The program crashes before mpi_init (at least the "hello" doesn't appear on the stdout):

>cat test.f90 
program test
    use mpi
    integer :: ierr
    real(8) :: a
    write(*,*)"hello"
    call mpi_init(ierr)
    call random_number(a)
    call mpi_finalize(ierr)
end program test
>mpif90 -openmp  test.f90 
>./a.out 
Segmentation fault


 

0 Kudos
Steven_L_Intel1
Employee
1,310 Views

How about a write to unit 10, followed by a close of (10), and look for fort.10. I'm trying to figure out how far it gets.

0 Kudos
schaefer__brandon
1,310 Views

The fort.10 file isn't even created:

>cat test.f90
program test
    use mpi
    integer :: ierr
    write(10,*)"hello"
    close(10)
    call mpi_init(ierr)
    call random_number(a)
    call mpi_finalize(ierr)
contains
end program test
>build/install/bin/mpif90 -openmp  test.f90
>./a.out
Segmentation fault
>cat fort.10
cat: fort.10: No such file or directory

 

0 Kudos
schaefer__brandon
1,310 Views

Do you have an idea why they crash happens so early?

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,310 Views

Unseen in the program listing above is the hidden call to initialize the Fortran Runtime Library. A guess at what is going on is the environment specified library load path points to an incompatible Fortran Runtime Library. IOW the .so file(s) are found, but are incompatible. What's odd about this is the program works (possibly by accident) when you comment out a few calls.

I seem to recall that there is an option to display the .so library paths as they are loaded. Can't remember what it is or if it is available on Linux.

Jim Dempsey

0 Kudos
schaefer__brandon
1,310 Views

I now tried a new version of the intel compiler. This resulted in a somewhat more detailed report:

>mpif90 --version
ifort (IFORT) 15.0.2 20150121
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

 >cat test.f90
program test
    use mpi
    integer :: ierr
    real(8) :: a
    call mpi_init(ierr)
    call random_number(a)
    write(*,*)"hello"
    call mpi_finalize(ierr)
end program test

>mpif90 -check all -traceback -fp-stack-check -fpe0 -fpe-all=0 -fstack-protector-all -mp1 -g -no-opt-assume-safe-padding -openmp test.f90
>./a.out 

a.out:21535 terminated with signal 11 at PC=40bb94 SP=7fffc9b45f10.  Backtrace:
./a.out[0x40bb94]
./a.out[0x40bafa]
./a.out(for__reentrancy_init+0x118)[0x40ba18]
/opt/soft/apps/OpenMPI/1.8.4-iccifort-2015.2.164-GCC-4.9.2/lib/libmpi_usempif08.so.0(for_rtl_init_+0x55)[0x2add2343d405]
./a.out[0x403aa9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2add24ba1d5d]
./a.out[0x403939]

 

0 Kudos
John_D_6
New Contributor I
1,310 Views

Hi Bastian,

since you're combining MPI and OpenMP, I'd suggest to use MPI_Init_thread with MPI_THREAD_FUNNELED as the required level of thread support.

0 Kudos
schaefer__brandon
1,310 Views

Hi John,
this doesn't make a difference:

 >cat test.f90 
program test
    use mpi
    integer :: ierr, iprov
    real(8) :: a
    call mpi_init_thread(MPI_THREAD_FUNNELED,iprov,ierr)
    call random_number(a)
    write(*,*)"hello"
    call mpi_finalize(ierr)
end program test
 >mpif90 --version
ifort (IFORT) 15.0.2 20150121
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

 >mpif90 -check all -traceback -fp-stack-check -fpe0 -fpe-all=0 -fstack-protector-all -mp1 -g -no-opt-assume-safe-padding -openmp test.f90
 >./a.out 

a.out:79331 terminated with signal 11 at PC=40bba4 SP=7fffeb4ef190.  Backtrace:
./a.out[0x40bba4]
./a.out[0x40bb0a]
./a.out(for__reentrancy_init+0x118)[0x40ba28]
/opt/soft/apps/OpenMPI/1.8.4-iccifort-2015.2.164-GCC-4.9.2/lib/libmpi_usempif08.so.0(for_rtl_init_+0x55)[0x2abceb27e405]
./a.out[0x403aa9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2abcec9e2d5d]
./a.out[0x403939]

 

0 Kudos
TimP
Honored Contributor III
1,310 Views

Although the standard specifies use of mpi_init_thread, we've seen that OpenMPI and Intel MPI support customer expectation that it's not required for MPI_THREAD_FUNNELED usage.  Anyway, the quoted test case doesn't go into threading.

It used to be that running under gdb (there is a special gdb to be installed with ifort/icc) could produce a more complete backtrace.  If you assert a fault in icc, you may need to find the location in your compiled code and view that.

It still may be that the OpenMPI support mail list could be more helpful.  It should be possible to build OpenMPI with gcc/g++/ifort as a check to verify your belief that icc is at fault.  There would not necessarily be any advantage in replacing gcc/g++ in the MPI build.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,310 Views

Tim,

I find it curious that for__reentrancy_init and for_rtl_init are not located within the same library.

./a.out(for__reentrancy_init+0x118)[0x40ba28]
/opt/soft/apps/OpenMPI/1.8.4-iccifort-2015.2.164-GCC-4.9.2/lib/libmpi_usempif08.so.0(for_rtl_init_+0x55)[0x2abceb27e405]

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,310 Views

jimdempseyatthecove wrote:

Tim,

I find it curious that for__reentrancy_init and for_rtl_init are not located within the same library.

./a.out(for__reentrancy_init+0x118)[0x40ba28]
/opt/soft/apps/OpenMPI/1.8.4-iccifort-2015.2.164-GCC-4.9.2/lib/libmpi_usempif08.so.0(for_rtl_init_+0x55)[0x2abceb27e405]

Jim Dempsey

I'm not so familiar with OpenMPI, nor would I expect others on Intel forums to know about it either.  I might guess that OpenMPI has a separate Fortran library for Fortran08 bindings.

0 Kudos
Steven_L_Intel1
Employee
1,310 Views

This is an interesting observation. It suggests that the OpenMPI .so is including pieces of the Intel Fortran library. This is not good.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,310 Views

Possibly an overly presumptuous use (fault) of IPO.

Jim Dempsey

0 Kudos
pbkenned1
Employee
1,100 Views

It was also reported that a build of  MVAPICH2 2.0 with the Intel 15.0 compilers caused the same issue - instant SEGV on the test case provided here.

I did a debug build from scratch of MVAPICH2-2.0.1 using ifort/icc-15.0.2.  Since I don't have a host with infiniband, I configured for TCP/shared mem.  For reference, my configuration:

./configure --prefix=/usr/local/MVAPICH2-2.0.1 --with-device=ch3:sock CC=icc CXX=icpc F77=ifort FC=ifort --enable-cxx --enable-hybrid --enable-fc --enable-g=all --enable-error-messages=all

Using the test case given in the description, I can't reproduce the issue:

[hacman@starch7]$ ~/bin/mpirun_rsh -hostfile ~/hosts -np 8 ./CQ367787.f90-mvapich.x
 hello
 hello
 hello
 hello
 hello
 hello
 hello
 hello
[hacman@starch7]$

Since this was also configured as a hybrid build, I checked that as well:

[hacman@starch7 LNX]$ ~/bin/mpif90 --version
ifort (IFORT) 15.0.2 20150121
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

[hacman@starch7 LNX]$ ~/bin/mpif90 -openmp hello_omp_mpi.f90 -o hello_omp_mpi.f90-mvapich.x
[hacman@starch7 LNX]$ ~/bin/mpirun_rsh -hostfile ~/hosts -np 4 ./hello_omp_mpi.f90-mvapich.x
 MPI Process number            0  of            4  is alive
 MPI Process number            3  of            4  is alive
 MPI Process number            1  of            4  is alive
 MPI Process number            2  of            4  is alive
 Number of OMP threads           12 on process            3
 Number of OMP threads           12 on process            0
 Number of OMP threads           12 on process            2
 Process            3 got an MPI message from process 0
 Process            2 got an MPI message from process 0
 Number of OMP threads           12 on process            1
 Process            1 got an MPI message from process 0
[hacman@starch7 LNX]$

Patrick

0 Kudos
Reply