I've been running on 5 (distributed memory) nodes (each has 20 processors) by using mpirun -n 5 -ppn 1 -hosts nd1,nd2,nd3,nd4,nd5.
Sometimes it works, sometimes, it gives inaccurate results, and sometimes it crashes with the error:
"[0:nd1] unexpected disconnect completion event from [35:nd2] Fatal error in PMPI_Comm_dup: Internal MPI error!, error stack ...".
Any suggestion to fix this communication error while running on multiple nodes with mpi (2017 update 2)?
I already set the stacksize to unlimited in my .rc. file. I tested this for two different applications (one is the famous distributed-memory solver, MUMPS). I have the same issue with both. This is not a very memory-demanding job. mpirun works perfectly on 1 node, this only happens on multiple nodes (even 2).
It seems like the environment flag “I_MPI_FABRICS=shm:tcp” solves the problem. Intel mpi then uses shm and tcp data transfer modes for intra- and inter-node communication, respectively. It picked dapl for inter-node communication, which is problematic for the cluster here.