Problem runing Intel MPI w/ IB

Sylvain_Korzennik · ‎06-04-2015

The same code, submitted to the queue (SGE) on our cluster, crashes right away some of the time (25% of the cases?) on the following error message:

libmpifort.so.12   00002AC615FAD9BC  Unknown               Unknown  Unknown
magic.exe          00000000004D4B01  step_time_mod_mp_         335  m_step_time.F90
magic.exe          00000000004FA8EA  MAIN__                    301  magic.F90
magic.exe          00000000004042CE  Unknown               Unknown  Unknown
libc.so.6          0000003C23C1D994  Unknown               Unknown  Unknown
magic.exe          00000000004041E9  Unknown               Unknown  Unknown
[mpiexec@compute-8-21.local] control_cb (../../pm/pmiserv/pmiserv_cb.c:764): assert (!closed) failed
[mpiexec@compute-8-21.local] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@compute-8-21.local] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:480): error waiting for event
[mpiexec@compute-8-21.local] main (../../ui/mpich/mpiexec.c:945): process manager error waiting for completion

Line 335 is a 'call mpi_barrier()' hence the libmpifort.so.12 I presume.

Since we use the Infiniband (I_MPI_FABRICS="shm:ofa") I checked that the IB is working with the exact same host list, using a trivial ring passing test program (in C and in F90). The ring passing programs completes fine, every time. Any clue how to investigate this?

The 'magic.exe' program (3rd party, scientific large simulation code) produces the following warning(s) although ti contineu running, when it starts ok - this could be unrelated.

[95] ERROR - handle_read_individual(): Get one packet, but need to be packetized  10628, 1 4604, 12280
[95] ERROR - handle_read_individual(): Get one packet, but need to be packetized  10628, 1 4604, 12280

Any help appreciated.

Sylvain,

BTW:

% ldd magic.exe
        linux-vdso.so.1 =>  (0x00007ffff37ef000)
        libmpifort.so.12 => /software/intel_2015/impi/5.0.1.035/intel64/lib/libmpifort.so.12 (0x00002ab829b40000)
        libmpi.so.12 => /software/intel_2015/impi/5.0.1.035/intel64/lib/debug/libmpi.so.12 (0x00002ab829dcd000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003e4de00000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003e4ea00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e4e200000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003e4da00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003e4d600000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003e5c800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003e4d200000)
and mpirun is aliased to /software/intel_2015/impi/5.0.1.035/bin64/mpirun, 
so it should not be a problem of mixing MPI implementations (we do support 
Intel, PGI and GNU).

Steve_H_Intel1 · ‎07-07-2015

Sylvain,

Some questions:

1) Is this a hybrid parallel programming application that uses both MPI and say OpenMP?

2) Can you run the MPI application outside of the scheduler?

3) Have you tried running the application with "mpiexec -check_mpi -n ..."?

4) How many MPI ranks are you using?

5) Is this symptom reproducible with 1 MPI rank?

6) In lieu of the environment variable setting "I_MPI_FABRICS="shm:ofa", could you please try using "-genv I_MPI_FABRICS shm:tcp"?