Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2153 Discussions

Crash or deadlock in MPI_Allgather

Craig_Artley
Beginner
2,024 Views
Attached is a C program which demonstrates the Intel MPI failure we observe here. The very simple program calls MPI_Allgather to gather one int from every host. Depending on the MPI ring configuration and number of processes, the program can succeed, deadlock, or crash.

Our cluster has several different kinds of nodes. For this experiment, you should know that the nodes
q01-12 are: Intel Xeon CPU E5320 @ 1.86GHz
q13-24 are: Intel Xeon CPU X5355 @ 2.66GHz
w01-48 are: Intel Xeon CPU E5620 @ 2.40GHz

They all have same OS.

If I run 2-4 processes, homogeneous or mixed, it runs.
If I run 6 processes all on w-nodes, no problem.
If I run 6 processes on a mix of q-nodes, no problem.
If I run 6 processes on a mix of q-nodes and even older "a" and "e" nodes, no problem.
If I run 6 processes on a mix of q and w nodes, the problem will crash or deadlock.

When it deadlocks, we see that some of the hosts have been released from the MPI_Allgather, but others are still stuck there.

I've attatched logs for a crash, running six processes on a mix of q and w nodes.
The second run is with I_MPI_DEBUG=100.

You will see that it complains about a truncated message.

Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(1671)..............: MPI_Allgather(sbuf=0x1b683ef0, scount=1, MPI_INT, rbuf=0x1b67fd10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)...............:
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 8

---------------
/*
* Unit test for Intel MPI failure.
* mpicc -mt_mpi -o lgcGab lgcGab.c
*/

#include
#include
#include

int main (int argc, char** argv)
{
int provided;
MPI_Comm comm;
int rank;
int size;
int* sendbuf;
int* recvbuf;

MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided);
if (provided printf("Need MPI_THREAD_MULTIPLE=%d but got %d\\n",
MPI_THREAD_MULTIPLE,provided);
MPI_Abort(MPI_COMM_WORLD,1);
}

//MPI_Comm_dup(MPI_COMM_WORLD,&comm);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&rank);
MPI_Comm_size(comm,&size);

sendbuf = (int*)malloc(sizeof(int));
recvbuf = (int*)malloc(size*sizeof(int));
sendbuf[0] = rank;

//MPI_Gather(sendbuf,1,MPI_INT, recvbuf,1,MPI_INT, 0,comm);

MPI_Allgather(sendbuf,1,MPI_INT, recvbuf,1,MPI_INT, comm);

//MPI_Barrier(comm);

MPI_Finalize();

return 0;
}
---------------
Here are the logs:
Last login: Sat Feb 19 12:43:48 2011 from 134.132.140.63
h $ ssh q12
Last login: Sat Feb 19 12:43:51 2011 from h

q12 $ . /panfs/pan1/data/cartley/impi/intel64/bin/mpivars.sh

q12 $ cd /pan/data/cartley/ssdev/main/prowess/nodist/src/org/hpjava/mpiJava/tests/ccl

q12 $ mpdboot.py --file=machinesast.txt -n 6

q12 $ mpdtrace
q12
q14
w02
w01
q13
w03

# q12 is: Intel Xeon CPU E5320 @ 1.86GHz
# q13-14 are: Intel Xeon CPU X5355 @ 2.66GHz
# w01-03 are: Intel Xeon CPU E5620 @ 2.40GHz

q12 $ mpirun -machinefile machinesast.txt -n 6 ./lgcGab
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(1671)..............: MPI_Allgather(sbuf=0x1b683ef0, scount=1, MPI_INT, rbuf=0x1b67fd10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)...............:
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 8
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0xca43d70, scount=1, MPI_INT, rbuf=0xca43da0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(210)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1f826d10, scount=1, MPI_INT, rbuf=0x1f826d40, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(210)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1a95bef0, scount=1, MPI_INT, rbuf=0x1a957d10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x31ebd10, scount=1, MPI_INT, rbuf=0x31ebd40, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(288)............:
MPIC_Recv(87)..................:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1771bef0, scount=1, MPI_INT, rbuf=0x17717d10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
rank 5 in job 1 q12_38214 caused collective abort of all ranks
exit status of rank 5: return code 1
rank 4 in job 1 q12_38214 caused collective abort of all ranks
exit status of rank 4: return code 1
rank 1 in job 1 q12_38214 caused collective abort of all ranks
exit status of rank 1: return code 1
rank 3 in job 1 q12_38214 caused collective abort of all ranks
exit status of rank 3: return code 1
rank 0 in job 1 q12_38214 caused collective abort of all ranks
exit status of rank 0: return code 1


Log with I_MPI_DEBUG=100

q12 $ mpirun -machinefile machinesast.txt -n 6 -env I_MPI_DEBUG 100 ./lgcGab
[0] MPI startup(): Intel MPI Library, Version 4.0 Update 1 Build 20100910
[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[3] my_dlopen(): trying to dlopen: libdat.so
[0] my_dlopen(): trying to dlopen: libdat.so
[4] my_dlopen(): trying to dlopen: libdat.so
[3] MPI startup(): cannot open dynamic library libdat.so
[1] my_dlopen(): trying to dlopen: libdat.so
[4] MPI startup(): cannot open dynamic library libdat.so
[1] MPI startup(): cannot open dynamic library libdat.so
[3] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[3] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[3] my_dlopen(): trying to dlopen: libdat2.so
[3] MPI startup(): cannot open dynamic library libdat2.so
[4] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[4] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[4] my_dlopen(): trying to dlopen: libdat2.so
[0] MPI startup(): cannot open dynamic library libdat.so
[0] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[0] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[0] my_dlopen(): trying to dlopen: libdat2.so
[0] MPI startup(): cannot open dynamic library libdat2.so
[0] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[0] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory
[1] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[1] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[1] my_dlopen(): trying to dlopen: libdat2.so
[1] MPI startup(): cannot open dynamic library libdat2.so
[3] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[3] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory
[4] MPI startup(): cannot open dynamic library libdat2.so
[4] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[4] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory
[1] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[1] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory
[5] my_dlopen(): trying to dlopen: libdat.so
[5] MPI startup(): cannot open dynamic library libdat.so
[2] my_dlopen(): trying to dlopen: libdat.so
[5] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[5] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[5] my_dlopen(): trying to dlopen: libdat2.so
[2] MPI startup(): cannot open dynamic library libdat.so
[2] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[2] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[2] my_dlopen(): trying to dlopen: libdat2.so
[5] MPI startup(): cannot open dynamic library libdat2.so
[5] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[5] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory
[2] MPI startup(): cannot open dynamic library libdat2.so
[2] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib
[2] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory
[0] MPI startup(): fabric dapl failed: will try use tcp fabric
[0] MPI startup(): tcp data transfer mode
[1] MPI startup(): fabric dapl failed: will try use tcp fabric
[2] MPI startup(): fabric dapl failed: will try use tcp fabric
[1] MPI startup(): tcp data transfer mode
[3] MPI startup(): fabric dapl failed: will try use tcp fabric
[3] MPI startup(): tcp data transfer mode
[4] MPI startup(): fabric dapl failed: will try use tcp fabric
[4] MPI startup(): tcp data transfer mode
[2] MPI startup(): tcp data transfer mode
[5] MPI startup(): fabric dapl failed: will try use tcp fabric
[5] MPI startup(): tcp data transfer mode
[1] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node q13
[1] MPI startup(): Recognition level=1. Platform code=1. Device=4
[1] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=0)
[2] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node q14
[2] MPI startup(): Recognition level=1. Platform code=1. Device=4
[2] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=0)
[0] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node q12
[0] MPI startup(): Recognition level=1. Platform code=1. Device=4
[0] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=0)
Device_reset_idx=0
[0] MPI startup(): Allgather: 1: 0-4096 & 3-2147483647
[0] MPI startup(): Allgather: 1: 16385-2147483647 & 3-4
[0] MPI startup(): Allgather: 1: 131073-2147483647 & 17-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 0-1024 & 0-16
[0] MPI startup(): Allreduce: 1: 0-16384 & 0-4
[0] MPI startup(): Allreduce: 3: 32769-262144 & 3-4
[0] MPI startup(): Allreduce: 4: 32769-2147483647 & 5-8
[0] MPI startup(): Allreduce: 6: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 1: 0-32 & 17-2147483647
[0] MPI startup(): Alltoall: 2: 0-262144 & 3-16
[0] MPI startup(): Alltoall: 2: 524289-2147483647 & 3-4
[0] MPI startup(): Alltoall: 2: 33-16384 & 17-2147483647
[0] MPI star[3] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node w01
[3] MPI startup(): Recognition level=1. Platform code=2. Device=4
[4] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node w02
[4] MPI startup(): Recognition level=1. Platform code=2. Device=4
[3] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=2 ppn_idx=0)
[4] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=2 ppn_idx=0)
tup(): Alltoall: 4: 262145-2147483647 & 5-16
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2
[0] MPI startup(): Barrier: 2: 0-2147483647 & 3-4
[0] MPI startup(): Barrier: 3: 0-2147483647 & 5-16
[0] MPI startup(): Barrier: 4: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 1: 0-2147483647 & 0-2
[0] MPI startup(): Bcast: 1: 0-8192 & 0-89
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 1: 0-512 & 0-2
[0] MPI startup(): Gather: 2: 65537-262144 & 3-8
[0] MPI startup(): Gather: 2: 131073-524288 & 9-32
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 0: 0-4 & 0-3
[0] MPI startup(): Reduce_scatter: 5: 257-512 & 9-16
[0] MPI startup(): Re[5] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node w03
[5] MPI startup(): Recognition level=1. Platform code=2. Device=4
[5] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=2 ppn_idx=0)
duce_scatter: 1: 0-32768 & 3-2147483647
[0] MPI startup(): Reduce_scatter: 1: 262145-524288 & 9-16
[0] MPI startup(): Reduce_scatter: 1: 524289-1048576 & 17-32
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 2: 129-256 & 0-2
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 1: 0-16384 & 0-2
[0] MPI startup(): Scatter: 2: 16385-2147483647 & 0-2
[0] MPI startup(): Scatter: 2: 1025-2147483647 & 3-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647
[0] Rank Pid Node name Pin cpu
[0] 0 1697 q12 {0,1,2,3,4,5,6,7}
[0] 1 21886 q13 {0,1,2,3,4,5,6,7}
[0] 2 14206 q14 {0,1,2,3,4,5,6,7}
[0] 3 24667 w01 {0,1,2,3,4,5,6,7}
[0] 4 3848 w02 {0,1,2,3,4,5,6,7}
[0] 5 3444 w03 {0,1,2,3,4,5,6,7}
[0] MPI startup(): I_MPI_DEBUG=100
[0] MPI startup(): I_MPI_INFO_BRAND=Intel Xeon
[0] MPI startup(): I_MPI_INFO_CACHE1=0,4,1,5,2,6,3,7
[0] MPI startup(): I_MPI_INFO_CACHE2=0,2,0,2,1,3,1,3
[0] MPI startup(): I_MPI_INFO_CACHE3=0,2,0,2,1,3,1,3
[0] MPI startup(): I_MPI_INFO_CACHES=2
[0] MPI startup(): I_MPI_INFO_CACHE_SHARE=1,2
[0] MPI startup(): I_MPI_INFO_CACHE_SIZE=32768,4194304
[0] MPI startup(): I_MPI_INFO_CORE=0,0,1,1,2,2,3,3
[0] MPI startup(): I_MPI_INFO_C_NAME=Clovertown
[0] MPI startup(): I_MPI_INFO_DESC=1342182600
[0] MPI startup(): I_MPI_INFO_FLGC=320445
[0] MPI startup(): I_MPI_INFO_FLGD=-1075053569
[0] MPI startup(): I_MPI_INFO_LCPU=8
[0] MPI startup(): I_MPI_INFO_MODE=259
[0] MPI startup(): I_MPI_INFO_PACK=0,1,0,1,0,1,0,1
[0] MPI startup(): I_MPI_INFO_SERIAL=E5320
[0] MPI startup(): I_MPI_INFO_SIGN=1783
[0] MPI startup(): I_MPI_INFO_STATE=ok
[0] MPI startup(): I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0
[0] MPI startup(): I_MPI_INFO_VEND=1
[0] MPI startup(): I_MPI_PIN_DOM=0,1,2,3,4,5,6,7
[0] MPI startup(): I_MPI_PIN_INFO=x0,1,2,3,4,5,6,7
[0] MPI startup(): I_MPI_PIN_MAP=0 0
[0] MPI startup(): I_MPI_PIN_MAP_SIZE=1
[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=34.239.17.212
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(1671)..............: MPI_Allgather(sbuf=0x14bd6520, scount=1, MPI_INT, rbuf=0x14bda3e0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)...............:
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 8
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1301520, scount=1, MPI_INT, rbuf=0x130dcf0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(210)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x19e8c520, scount=1, MPI_INT, rbuf=0x19e903b0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(288)............:
MPIC_Recv(87)..................:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0xfb6b520, scount=1, MPI_INT, rbuf=0xfb6f3b0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(210)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x17774520, scount=1, MPI_INT, rbuf=0x177783e0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x156ac520, scount=1, MPI_INT, rbuf=0x156b03e0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather(520)............:
MPIC_Sendrecv(172).............:
MPIC_Wait(416).................:
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
rank 5 in job 1 q12_47899 caused collective abort of all ranks
exit status of rank 5: return code 1
rank 4 in job 1 q12_47899 caused collective abort of all ranks
exit status of rank 4: return code 1
rank 3 in job 1 q12_47899 caused collective abort of all ranks
exit status of rank 3: return code 1
rank 1 in job 1 q12_47899 caused collective abort of all ranks
exit status of rank 1: return code 1
rank 0 in job 1 q12_47899 caused collective abort of all ranks
exit status of rank 0: return code 1

0 Kudos
1 Solution
Dmitry_K_Intel2
Employee
2,024 Views
Hi Craig,

In the debug output you can see that hardware used on different nodes is different:
[0] MPI startup(): Recognition level=1. Platform code=1. Device=4
[3] MPI startup(): Recognition level=1. Platform code=2. Device=4

It means that different algorithms can be used on different nodes. And as the result an application may hang.
In such cases you can use enviroment variable I_MPI_PLATFORM:
export I_MPI_PLATFORM=auto

Startup phase will be a bit longer, but appplication will work.

Regards!
Dmitry

View solution in original post

0 Kudos
3 Replies
Dmitry_K_Intel2
Employee
2,025 Views
Hi Craig,

In the debug output you can see that hardware used on different nodes is different:
[0] MPI startup(): Recognition level=1. Platform code=1. Device=4
[3] MPI startup(): Recognition level=1. Platform code=2. Device=4

It means that different algorithms can be used on different nodes. And as the result an application may hang.
In such cases you can use enviroment variable I_MPI_PLATFORM:
export I_MPI_PLATFORM=auto

Startup phase will be a bit longer, but appplication will work.

Regards!
Dmitry
0 Kudos
Craig_Artley
Beginner
2,024 Views
That's great! Thanks very much. That fixes the test program, and more importantly, it fixes our product's entire test suite.

I could not find this switch in the Reference Manual (4.0.1.007). Is it documented anywhere? I have to say that it seems like an odd default to fail in this way.

Regards,
-craig
0 Kudos
Dmitry_K_Intel2
Employee
2,024 Views
This is new option appeared in 4.0.1 and we haven't agreed the final naming yet that is why this option was not added into the Reference Manual.

Regards!
Dmitry
0 Kudos
Reply