Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mcs-systems
Beginner
110 Views

MPI errors on large OPA fabric

Hello,

We're getting MPI communication errors using Intel MPI on our cluster using omnipath.  This is a job using 931 nodes, smaller runs using 600 nodes execute properly.

Other details:

We're using Intel Parallel Studio 2017 update 4 (compilers_and_libraries_2017.4.196).

There are 1024 total nodes on the fabric, we would like to run jobs utilizing the entire cluster.

This is an HPL run using Intel l_mklb_p_2017.3.017.

This is an example of the errors we see - what is interesting is the buffer and target size is the same, however the error states it is truncated.  Is there normally a header the target buffer needs to have space for?

Fatal error in MPI_Recv: Message truncated, error stack:
MPI_Recv(224)................: MPI_Recv(buf=0x2b1ee8401840, count=1455, MPI_DOUBLE, src=17, tag=10001, comm=0x84000002, status=0x7ffef5ddfe50) failed
MPID_nem_tmi_handle_rreq(738): Message from rank 17 and tag 10001 truncated; 11640 bytes received but buffer size is 11640
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b93ba000000, scount=1164, MPI_DOUBLE, dest=13, stag=10001, rbuf=0x2b93ba002460, rcount=1746, MPI_DOUBLE, src=13, rtag=10001, comm=0x84000002, status=0x7ffcec3f3f50) failed
MPID_nem_tmi_handle_rreq(738): Message from rank 13 and tag 10001 truncated; 13968 bytes received but buffer size is 13968
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b30f5880808, scount=24576, MPI_DOUBLE, dest=16, stag=10001, rbuf=0x2b30ef400000, rcount=1164, MPI_DOUBLE, src=16, rtag=10001, comm=0x84000002, status=0x7ffc4278ec10) failed

 

Tags (1)
0 Kudos
1 Reply
TimP
Black Belt
110 Views

The question seems more appropriate to the cluster hpc forum, if you could quote intel cluster checker diagnoses.

Reply