Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Maggard__Bryan
Beginner
191 Views

Assertion Failure, Intel MPI (Linux), 2019

Intel MPI 2019 on Linux was installed and tested with several MPI programs (gcc and gfortran from GCC 8.2), with no issues, using the following environment setup.

export I_MPI_DEBUG=5
export I_MPI_LIBRARY_KIND=debug
export I_MPI_OFI_LIBRARY_INTERNAL=1
. ~/intel/compilers_and_libraries_2019.0.117/linux/mpi/intel64/bin/mpivars.sh

I did create a symlink 'mpifort' pointing to mpifc (for compatibility with the mpich/OpenMPI way of doing things).

 

I've been trying to get OpenCoarrays-2.2.0 (opencoarrays.org) working with Intel MPI 2019, on Linux, for gfortran (GCC 8.2) to implement a coarray-fortran (caf) development implementation.  Since OpenCoarrays is developed and tested against the mpich MPI implementation, I was optimistic that Intel MPI could work too, based on mpich ABI compatibility.

The install.sh script that can be used to build OpenCoarrays finds the expected ULFM routines (see fault-tolerance.org), and builds libcaf_mpi.so with the compiler variable, -DUSE_FAILED_IMAGES defined.

-- Looking for signal.h - found
-- Looking for SIGKILL
-- Looking for SIGKILL - found
-- Looking for include files mpi.h, mpi-ext.h
-- Looking for include files mpi.h, mpi-ext.h - not found
-- Looking for MPIX_ERR_PROC_FAILED
-- Looking for MPIX_ERR_PROC_FAILED - found
-- Looking for MPIX_ERR_REVOKED
-- Looking for MPIX_ERR_REVOKED - found
-- Looking for MPIX_Comm_failure_ack
-- Looking for MPIX_Comm_failure_ack - found
-- Looking for MPIX_Comm_failure_get_acked
-- Looking for MPIX_Comm_failure_get_acked - found
-- Looking for MPIX_Comm_shrink
-- Looking for MPIX_Comm_shrink - found
-- Looking for MPIX_Comm_agree
-- Looking for MPIX_Comm_agree - found

However, when attempting to execute code compiled (with mpifc) from coarray fortran under mpirun, there is a failed assertion, as shown below.  This output is from building the mpi_caf.o and caf_auxiliary.o object files that comprise libcaf_mpi.so, when compiled with -g and linked to a coarray fortran program (using -fcoarray=lib) also compiled with -g (and other relevant settings obtained from caf -show, mpicc -show, and mpifc -show). See "Assertion failed in file ../../src/mpid/ch4/src/ch4_comm.h at line 89: 0".

[bmaggard@localhost oca]$ mpiexec.hydra -genv I_MPI_DEBUG=5 -gdb -n 1 ./a.out
mpigdb: attaching to 17651 ./a.out localhost.localdomain
[0] (mpigdb) start
[0]     The program being debugged has been started already.
[0]     Start it from the beginning? (y or n) [answered Y; input not from terminal]
[0]     Temporary breakpoint 1 at 0x402737: file pi_caf.f90, line 1.
[0]     Starting program: /home/bmaggard/oca/a.out
[bmaggard@localhost oca]$ [0]   [Thread debugging using libthread_db enabled]
[0]     Using host libthread_db library "/lib64/libthread_db.so.1".
[0]     [New Thread 0x7ffff42bf700 (LWP 17694)]
[0]     [New Thread 0x7ffff3abe700 (LWP 17695)]
[0]     Detaching after fork from child process 17696.
[0]     Assertion failed in file ../../src/mpid/ch4/src/ch4_comm.h at line 89: 0
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0xbb298e) [0x7ffff6a9f98e]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPL_backtrace_show+0x18) [0x7ffff6a9fafd]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPIR_Assert_fail+0x5c) [0x7ffff6101e0b]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0x2fc72c) [0x7ffff61e972c]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0x2fc832) [0x7ffff61e9832]
[0]     /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPIX_Comm_agree+0x518) [0x7ffff61ea221]
[0]     /home/bmaggard/oca/a.out() [0x40399b]
[0]     /home/bmaggard/oca/a.out() [0x404083]
[0]     /home/bmaggard/oca/a.out() [0x402716]
[0]     /home/bmaggard/oca/a.out() [0x416fc5]
[0]     /lib64/libc.so.6(__libc_start_main+0x7a) [0x7ffff4f150aa]
[0]     /home/bmaggard/oca
[0]     Abort(1) on node 0: Internal error
[0]     [Thread 0x7ffff3abe700 (LWP 17695) exited]
[0]     [Thread 0x7ffff42bf700 (LWP 17694) exited]
[0]     [Inferior 1 (process 17680) exited with code 01]
[0] (mpigdb) mpigdb: ending..
mpigdb: kill 17651

The same assertion failure was observed under Win64, and there is a bit more information indicating where to look:

[0] MPI startup(): libfabric version: 1.6.1a1-impi
[0] MPI startup(): libfabric provider: sockets
[0] MPI startup(): Rank    Pid      Node name      Pin cpu
[0] MPI startup(): 0       8364     pe-mgr-laptop  {0,1,2,3,4,5,6,7}
Assertion failed in file c:\iusers\jenkins\workspace\ch4-build-windows\impi-ch4-build-windows-builder\\src\mpid\ch4\src\ch4_comm.h at line 89: 0
No backtrace info available
Abort(1) on node 0: Internal error

Inspecting the mpich source code (https://github.com/pmodels/mpich, tag v3.3b2) of src/mpid/ch4/src/ch4_comm.h shows the following (lines 88-97).

MPL_STATIC_INLINE_PREFIX int MPID_Comm_revoke(MPIR_Comm * comm_ptr, int is_remote)
{
    MPIR_FUNC_VERBOSE_STATE_DECL(MPID_STATE_MPID_COMM_REVOKE);
    MPIR_FUNC_VERBOSE_ENTER(MPID_STATE_MPID_COMM_REVOKE);

    MPIR_Assert(0);

    MPIR_FUNC_VERBOSE_EXIT(MPID_STATE_MPID_COMM_REVOKE);
    return 0;
}

If I comment out the part of the OpenCoarrays-2.2.0 build system (in src/mpi/CMakeLists.txt) that adds the -DUSE_FAILED_IMAGES definition when building libcaf_mpi.so, then 44 of the first 51 OpenCoarrays-2.2.0 test cases pass with Intel MPI 2019 (none of which use failed images) proving the concept that Intel MPI could work.  All 78 tests (including those using failed images) pass with mpich-3.3b3, but this is expected, since mpich is the OpenCoarrays development MPI.

I would like to learn more about this assertion failure, and how the '#ifdef USE_FAILED_IMAGES' in OpenCoarrays-2.2.0/src/mpi/mpi_caf.c interact to cause this assertion to fail.  I also wanted to bring this to the Intel MPI developer(s) attention as they work toward release of 2019, Update 1.

Update (Oct. 24, 2018):  I have confirmed with Intel MPI 2018, Update 1, that OpenCoarrays-2.2.0/GCC 8.2 (gcc/gfortran 8.2.1-3 of msys2.org)/Intel-MPI works with -DUSE_FAILED_IMAGES on caf code that does not use failed images features (no assertion failure, which makes sense since the assertion failure appears to be in the ch4 part of mpich).

Update (Oct. 25, 2018): I have confirmed with Intel MPI 2019 using $I_MPI_LIBRARY_KIND=release that similar behavior occurs (as when using the debug version of Intel MPI 2019), with SEGFAULT occuring with -DUSE_FAILED_IMAGES defined when building libcaf_mpi.a, and the same OpenCoarrays-2.2.0 test cases passing with this compile variable not defined.

Update (Oct. 28, 2018), on Win64: With Intel MPI 2018, Update 1, the first 69/69 tests pass, with those failing all being in the category of "failed_image_" ...   With Intel MPI 2019, Initial Release, 44 of the first 52 tests pass, similar to linux.  Here, I had to kill hung test processes that were taking a full core, in order for ctest to proceed to the next test.

 

0 Kudos
1 Reply
Izaak_Beekman
New Contributor II
191 Views

Hi Bryan,

MPI-4 did not settle on a ULFM implementation, and completely scuttled the existing proposals in favor of lower-level APIs so as to not lock in or dictate certain implementations in MPI based languages/libraries/etc. As a result, we have decided to disabled, by default, the failed images support in the next (forthcoming) OpenCoarrays release.

I'm curious what the status is of using Intel MPI on Linux (and any other platform) if you configure OpenCoarrays with `-DCAF_ENABLE_FAILED_IMAGES:BOOL=OFF`, even when the MPIX functions are found. Please let us know of test failures you encounter when failed images is off, and you are using IMPI. Feel free to post on https://github.com/sourceryinstitute/OpenCoarrays/issues/new

Thanks!

Reply