Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

mpirun: unexpected disconnect completion event

Vaziri__Ali
Beginner
1,198 Views

 

Hi,

I've been running on 5 (distributed memory) nodes (each has 20 processors) by using mpirun -n 5 -ppn 1 -hosts nd1,nd2,nd3,nd4,nd5.

Sometimes it works, sometimes, it gives inaccurate results, and sometimes it crashes with the error:

"[0:nd1] unexpected disconnect completion event from [35:nd2] Fatal error in PMPI_Comm_dup: Internal MPI error!, error stack ...". 

Any suggestion to fix this communication error while running on multiple nodes with mpi (2017 update 2)?

I already set the stacksize to unlimited in my .rc. file. I tested this for two different applications (one is the famous distributed-memory solver, MUMPS). I have the same issue with both. This is not a very memory-demanding job. mpirun works perfectly on 1 node, this only happens on multiple nodes (even 2).

Thanks

 

0 Kudos
1 Reply
Vaziri__Ali
Beginner
1,198 Views

It seems like the environment flag “I_MPI_FABRICS=shm:tcp” solves the problem. Intel mpi then uses shm and tcp data transfer modes for intra- and inter-node communication, respectively. It picked dapl for inter-node communication, which is problematic for the cluster here.

0 Kudos
Reply