<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Crash or deadlock in MPI_Allgather in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797044#M739</link>
    <description>This is new option appeared in 4.0.1 and we haven't agreed the final naming yet that is why this option was not added into the Reference Manual.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt; Dmitry&lt;BR /&gt;</description>
    <pubDate>Thu, 24 Feb 2011 12:37:50 GMT</pubDate>
    <dc:creator>Dmitry_K_Intel2</dc:creator>
    <dc:date>2011-02-24T12:37:50Z</dc:date>
    <item>
      <title>Crash or deadlock in MPI_Allgather</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797041#M736</link>
      <description>Attached is a C program which demonstrates the Intel MPI failure we observe here. The very simple program calls MPI_Allgather to gather one int from every host. Depending on the MPI ring configuration and number of processes, the program can succeed, deadlock, or crash. &lt;BR /&gt;&lt;BR /&gt;Our cluster has several different kinds of nodes. For this experiment, you should know that the nodes &lt;BR /&gt;q01-12 are: Intel Xeon CPU E5320 @ 1.86GHz &lt;BR /&gt;q13-24 are: Intel Xeon CPU X5355 @ 2.66GHz &lt;BR /&gt;w01-48 are: Intel Xeon CPU E5620 @ 2.40GHz &lt;BR /&gt;&lt;BR /&gt;They all have same OS. &lt;BR /&gt;&lt;BR /&gt;If I run 2-4 processes, homogeneous or mixed, it runs. &lt;BR /&gt;If I run 6 processes all on w-nodes, no problem. &lt;BR /&gt;If I run 6 processes on a mix of q-nodes, no problem. &lt;BR /&gt;If I run 6 processes on a mix of q-nodes and even older "a" and "e" nodes, no problem. &lt;BR /&gt;If I run 6 processes on a mix of q and w nodes, the problem will crash or deadlock. &lt;BR /&gt;&lt;BR /&gt;When it deadlocks, we see that some of the hosts have been released from the MPI_Allgather, but others are still stuck there. &lt;BR /&gt;&lt;BR /&gt;I've attatched logs for a crash, running six processes on a mix of q and w nodes. &lt;BR /&gt;The second run is with I_MPI_DEBUG=100. &lt;BR /&gt;&lt;BR /&gt;You will see that it complains about a truncated message. &lt;BR /&gt;&lt;BR /&gt;Fatal error in PMPI_Allgather: Message truncated, error stack: &lt;BR /&gt;PMPI_Allgather(1671)..............: MPI_Allgather(sbuf=0x1b683ef0, scount=1, MPI_INT, rbuf=0x1b67fd10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed &lt;BR /&gt;MPIR_Allgather(520)...............: &lt;BR /&gt;MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 8 &lt;BR /&gt;&lt;BR /&gt;---------------&lt;BR /&gt;/*&lt;BR /&gt;* Unit test for Intel MPI failure.&lt;BR /&gt;* mpicc -mt_mpi -o lgcGab lgcGab.c&lt;BR /&gt;*/&lt;BR /&gt;&lt;BR /&gt;#include &lt;MPI.H&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;STDLIB.H&gt;&lt;BR /&gt;&lt;BR /&gt;int main (int argc, char** argv)&lt;BR /&gt;{&lt;BR /&gt; int provided;&lt;BR /&gt; MPI_Comm comm;&lt;BR /&gt; int rank;&lt;BR /&gt; int size;&lt;BR /&gt; int* sendbuf;&lt;BR /&gt; int* recvbuf;&lt;BR /&gt;&lt;BR /&gt; MPI_Init_thread(&amp;amp;argc,&amp;amp;argv,MPI_THREAD_MULTIPLE,&amp;amp;provided);&lt;BR /&gt; if (provided&lt;MPI_THREAD_MULTIPLE&gt;&lt;/MPI_THREAD_MULTIPLE&gt; printf("Need MPI_THREAD_MULTIPLE=%d but got %d\\n",&lt;BR /&gt; MPI_THREAD_MULTIPLE,provided);&lt;BR /&gt; MPI_Abort(MPI_COMM_WORLD,1);&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt; //MPI_Comm_dup(MPI_COMM_WORLD,&amp;amp;comm);&lt;BR /&gt; comm = MPI_COMM_WORLD;&lt;BR /&gt; MPI_Comm_rank(comm,&amp;amp;rank);&lt;BR /&gt; MPI_Comm_size(comm,&amp;amp;size);&lt;BR /&gt;&lt;BR /&gt; sendbuf = (int*)malloc(sizeof(int));&lt;BR /&gt; recvbuf = (int*)malloc(size*sizeof(int));&lt;BR /&gt; sendbuf[0] = rank;&lt;BR /&gt;&lt;BR /&gt; //MPI_Gather(sendbuf,1,MPI_INT, recvbuf,1,MPI_INT, 0,comm);&lt;BR /&gt;&lt;BR /&gt; MPI_Allgather(sendbuf,1,MPI_INT, recvbuf,1,MPI_INT, comm);&lt;BR /&gt;&lt;BR /&gt; //MPI_Barrier(comm);&lt;BR /&gt;&lt;BR /&gt; MPI_Finalize();&lt;BR /&gt;&lt;BR /&gt; return 0;&lt;BR /&gt;}&lt;BR /&gt;---------------&lt;BR /&gt;Here are the logs:&lt;BR /&gt;Last login: Sat Feb 19 12:43:48 2011 from 134.132.140.63&lt;BR /&gt;h $ ssh q12&lt;BR /&gt;Last login: Sat Feb 19 12:43:51 2011 from h&lt;BR /&gt;&lt;BR /&gt;q12 $ . /panfs/pan1/data/cartley/impi/intel64/bin/mpivars.sh&lt;BR /&gt;&lt;BR /&gt;q12 $ cd /pan/data/cartley/ssdev/main/prowess/nodist/src/org/hpjava/mpiJava/tests/ccl&lt;BR /&gt;&lt;BR /&gt;q12 $ mpdboot.py --file=machinesast.txt -n 6&lt;BR /&gt;&lt;BR /&gt;q12 $ mpdtrace&lt;BR /&gt;q12&lt;BR /&gt;q14&lt;BR /&gt;w02&lt;BR /&gt;w01&lt;BR /&gt;q13&lt;BR /&gt;w03&lt;BR /&gt;&lt;BR /&gt;# q12 is: Intel Xeon CPU E5320 @ 1.86GHz&lt;BR /&gt;# q13-14 are: Intel Xeon CPU X5355 @ 2.66GHz&lt;BR /&gt;# w01-03 are: Intel Xeon CPU E5620 @ 2.40GHz&lt;BR /&gt;&lt;BR /&gt;q12 $ mpirun -machinefile machinesast.txt -n 6 ./lgcGab&lt;BR /&gt;Fatal error in PMPI_Allgather: Message truncated, error stack:&lt;BR /&gt;PMPI_Allgather(1671)..............: MPI_Allgather(sbuf=0x1b683ef0, scount=1, MPI_INT, rbuf=0x1b67fd10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(520)...............: &lt;BR /&gt;MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 8&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0xca43d70, scount=1, MPI_INT, rbuf=0xca43da0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(210)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1f826d10, scount=1, MPI_INT, rbuf=0x1f826d40, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(210)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1a95bef0, scount=1, MPI_INT, rbuf=0x1a957d10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(520)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x31ebd10, scount=1, MPI_INT, rbuf=0x31ebd40, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(288)............: &lt;BR /&gt;MPIC_Recv(87)..................: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1771bef0, scount=1, MPI_INT, rbuf=0x17717d10, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(520)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;rank 5 in job 1 q12_38214 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 5: return code 1 &lt;BR /&gt;rank 4 in job 1 q12_38214 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 4: return code 1 &lt;BR /&gt;rank 1 in job 1 q12_38214 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 1: return code 1 &lt;BR /&gt;rank 3 in job 1 q12_38214 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 3: return code 1 &lt;BR /&gt;rank 0 in job 1 q12_38214 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 0: return code 1 &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Log with I_MPI_DEBUG=100&lt;BR /&gt;&lt;BR /&gt;q12 $ mpirun -machinefile machinesast.txt -n 6 -env I_MPI_DEBUG 100 ./lgcGab&lt;BR /&gt;[0] MPI startup(): Intel MPI Library, Version 4.0 Update 1 Build 20100910&lt;BR /&gt;[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.&lt;BR /&gt;[3] my_dlopen(): trying to dlopen: libdat.so&lt;BR /&gt;[0] my_dlopen(): trying to dlopen: libdat.so&lt;BR /&gt;[4] my_dlopen(): trying to dlopen: libdat.so&lt;BR /&gt;[3] MPI startup(): cannot open dynamic library libdat.so&lt;BR /&gt;[1] my_dlopen(): trying to dlopen: libdat.so&lt;BR /&gt;[4] MPI startup(): cannot open dynamic library libdat.so&lt;BR /&gt;[1] MPI startup(): cannot open dynamic library libdat.so&lt;BR /&gt;[3] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[3] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[3] my_dlopen(): trying to dlopen: libdat2.so&lt;BR /&gt;[3] MPI startup(): cannot open dynamic library libdat2.so&lt;BR /&gt;[4] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[4] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[4] my_dlopen(): trying to dlopen: libdat2.so&lt;BR /&gt;[0] MPI startup(): cannot open dynamic library libdat.so&lt;BR /&gt;[0] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[0] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[0] my_dlopen(): trying to dlopen: libdat2.so&lt;BR /&gt;[0] MPI startup(): cannot open dynamic library libdat2.so&lt;BR /&gt;[0] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[0] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[1] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[1] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[1] my_dlopen(): trying to dlopen: libdat2.so&lt;BR /&gt;[1] MPI startup(): cannot open dynamic library libdat2.so&lt;BR /&gt;[3] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[3] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[4] MPI startup(): cannot open dynamic library libdat2.so&lt;BR /&gt;[4] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[4] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[1] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[1] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[5] my_dlopen(): trying to dlopen: libdat.so&lt;BR /&gt;[5] MPI startup(): cannot open dynamic library libdat.so&lt;BR /&gt;[2] my_dlopen(): trying to dlopen: libdat.so&lt;BR /&gt;[5] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[5] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[5] my_dlopen(): trying to dlopen: libdat2.so&lt;BR /&gt;[2] MPI startup(): cannot open dynamic library libdat.so&lt;BR /&gt;[2] my_dlopen(): Look for library libdat.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[2] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[2] my_dlopen(): trying to dlopen: libdat2.so&lt;BR /&gt;[5] MPI startup(): cannot open dynamic library libdat2.so&lt;BR /&gt;[5] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[5] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[2] MPI startup(): cannot open dynamic library libdat2.so&lt;BR /&gt;[2] my_dlopen(): Look for library libdat2.so in /panfs/pan1/data/cartley/impi/intel64/libinclude ld.so.conf.d/*.conf,,/lib,/usr/lib&lt;BR /&gt;[2] my_dlopen(): dlopen failed: libdat2.so: cannot open shared object file: No such file or directory&lt;BR /&gt;[0] MPI startup(): fabric dapl failed: will try use tcp fabric&lt;BR /&gt;[0] MPI startup(): tcp data transfer mode&lt;BR /&gt;[1] MPI startup(): fabric dapl failed: will try use tcp fabric&lt;BR /&gt;[2] MPI startup(): fabric dapl failed: will try use tcp fabric&lt;BR /&gt;[1] MPI startup(): tcp data transfer mode&lt;BR /&gt;[3] MPI startup(): fabric dapl failed: will try use tcp fabric&lt;BR /&gt;[3] MPI startup(): tcp data transfer mode&lt;BR /&gt;[4] MPI startup(): fabric dapl failed: will try use tcp fabric&lt;BR /&gt;[4] MPI startup(): tcp data transfer mode&lt;BR /&gt;[2] MPI startup(): tcp data transfer mode&lt;BR /&gt;[5] MPI startup(): fabric dapl failed: will try use tcp fabric&lt;BR /&gt;[5] MPI startup(): tcp data transfer mode&lt;BR /&gt;[1] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node q13&lt;BR /&gt;[1] MPI startup(): Recognition level=1. Platform code=1. Device=4&lt;BR /&gt;[1] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=0)&lt;BR /&gt;[2] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node q14&lt;BR /&gt;[2] MPI startup(): Recognition level=1. Platform code=1. Device=4&lt;BR /&gt;[2] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=0)&lt;BR /&gt;[0] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node q12&lt;BR /&gt;[0] MPI startup(): Recognition level=1. Platform code=1. Device=4&lt;BR /&gt;[0] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=0)&lt;BR /&gt;Device_reset_idx=0&lt;BR /&gt;[0] MPI startup(): Allgather: 1: 0-4096 &amp;amp; 3-2147483647&lt;BR /&gt;[0] MPI startup(): Allgather: 1: 16385-2147483647 &amp;amp; 3-4&lt;BR /&gt;[0] MPI startup(): Allgather: 1: 131073-2147483647 &amp;amp; 17-2147483647&lt;BR /&gt;[0] MPI startup(): Allgather: 3: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Allgatherv: 0: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Allreduce: 1: 0-1024 &amp;amp; 0-16&lt;BR /&gt;[0] MPI startup(): Allreduce: 1: 0-16384 &amp;amp; 0-4&lt;BR /&gt;[0] MPI startup(): Allreduce: 3: 32769-262144 &amp;amp; 3-4&lt;BR /&gt;[0] MPI startup(): Allreduce: 4: 32769-2147483647 &amp;amp; 5-8&lt;BR /&gt;[0] MPI startup(): Allreduce: 6: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Alltoall: 1: 0-32 &amp;amp; 17-2147483647&lt;BR /&gt;[0] MPI startup(): Alltoall: 2: 0-262144 &amp;amp; 3-16&lt;BR /&gt;[0] MPI startup(): Alltoall: 2: 524289-2147483647 &amp;amp; 3-4&lt;BR /&gt;[0] MPI startup(): Alltoall: 2: 33-16384 &amp;amp; 17-2147483647&lt;BR /&gt;[0] MPI star[3] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node w01&lt;BR /&gt;[3] MPI startup(): Recognition level=1. Platform code=2. Device=4&lt;BR /&gt;[4] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node w02&lt;BR /&gt;[4] MPI startup(): Recognition level=1. Platform code=2. Device=4&lt;BR /&gt;[3] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=2 ppn_idx=0)&lt;BR /&gt;[4] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=2 ppn_idx=0)&lt;BR /&gt;tup(): Alltoall: 4: 262145-2147483647 &amp;amp; 5-16&lt;BR /&gt;[0] MPI startup(): Alltoall: 3: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Alltoallv: 1: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Alltoallw: 0: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Barrier: 1: 0-2147483647 &amp;amp; 0-2&lt;BR /&gt;[0] MPI startup(): Barrier: 2: 0-2147483647 &amp;amp; 3-4&lt;BR /&gt;[0] MPI startup(): Barrier: 3: 0-2147483647 &amp;amp; 5-16&lt;BR /&gt;[0] MPI startup(): Barrier: 4: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Bcast: 1: 0-2147483647 &amp;amp; 0-2&lt;BR /&gt;[0] MPI startup(): Bcast: 1: 0-8192 &amp;amp; 0-89&lt;BR /&gt;[0] MPI startup(): Bcast: 7: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Exscan: 0: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Gather: 1: 0-512 &amp;amp; 0-2&lt;BR /&gt;[0] MPI startup(): Gather: 2: 65537-262144 &amp;amp; 3-8&lt;BR /&gt;[0] MPI startup(): Gather: 2: 131073-524288 &amp;amp; 9-32&lt;BR /&gt;[0] MPI startup(): Gather: 3: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Gatherv: 1: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Reduce_scatter: 0: 0-4 &amp;amp; 0-3&lt;BR /&gt;[0] MPI startup(): Reduce_scatter: 5: 257-512 &amp;amp; 9-16&lt;BR /&gt;[0] MPI startup(): Re[5] MPI startup(): set domain to {0,1,2,3,4,5,6,7} on node w03&lt;BR /&gt;[5] MPI startup(): Recognition level=1. Platform code=2. Device=4&lt;BR /&gt;[5] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=2 ppn_idx=0)&lt;BR /&gt;duce_scatter: 1: 0-32768 &amp;amp; 3-2147483647&lt;BR /&gt;[0] MPI startup(): Reduce_scatter: 1: 262145-524288 &amp;amp; 9-16&lt;BR /&gt;[0] MPI startup(): Reduce_scatter: 1: 524289-1048576 &amp;amp; 17-32&lt;BR /&gt;[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Reduce: 2: 129-256 &amp;amp; 0-2&lt;BR /&gt;[0] MPI startup(): Reduce: 1: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Scan: 0: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Scatter: 1: 0-16384 &amp;amp; 0-2&lt;BR /&gt;[0] MPI startup(): Scatter: 2: 16385-2147483647 &amp;amp; 0-2&lt;BR /&gt;[0] MPI startup(): Scatter: 2: 1025-2147483647 &amp;amp; 3-2147483647&lt;BR /&gt;[0] MPI startup(): Scatter: 3: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] MPI startup(): Scatterv: 1: 0-2147483647 &amp;amp; 0-2147483647&lt;BR /&gt;[0] Rank Pid Node name Pin cpu&lt;BR /&gt;[0] 0 1697 q12 {0,1,2,3,4,5,6,7}&lt;BR /&gt;[0] 1 21886 q13 {0,1,2,3,4,5,6,7}&lt;BR /&gt;[0] 2 14206 q14 {0,1,2,3,4,5,6,7}&lt;BR /&gt;[0] 3 24667 w01 {0,1,2,3,4,5,6,7}&lt;BR /&gt;[0] 4 3848 w02 {0,1,2,3,4,5,6,7}&lt;BR /&gt;[0] 5 3444 w03 {0,1,2,3,4,5,6,7}&lt;BR /&gt;[0] MPI startup(): I_MPI_DEBUG=100&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_BRAND=Intel Xeon &lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CACHE1=0,4,1,5,2,6,3,7&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CACHE2=0,2,0,2,1,3,1,3&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CACHE3=0,2,0,2,1,3,1,3&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CACHES=2&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CACHE_SHARE=1,2&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CACHE_SIZE=32768,4194304&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_CORE=0,0,1,1,2,2,3,3&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_C_NAME=Clovertown&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_DESC=1342182600&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_FLGC=320445&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_FLGD=-1075053569&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_LCPU=8&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_MODE=259&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_PACK=0,1,0,1,0,1,0,1&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_SERIAL=E5320 &lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_SIGN=1783&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_STATE=ok&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0&lt;BR /&gt;[0] MPI startup(): I_MPI_INFO_VEND=1&lt;BR /&gt;[0] MPI startup(): I_MPI_PIN_DOM=0,1,2,3,4,5,6,7&lt;BR /&gt;[0] MPI startup(): I_MPI_PIN_INFO=x0,1,2,3,4,5,6,7&lt;BR /&gt;[0] MPI startup(): I_MPI_PIN_MAP=0 0&lt;BR /&gt;[0] MPI startup(): I_MPI_PIN_MAP_SIZE=1&lt;BR /&gt;[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=34.239.17.212&lt;BR /&gt;Fatal error in PMPI_Allgather: Message truncated, error stack:&lt;BR /&gt;PMPI_Allgather(1671)..............: MPI_Allgather(sbuf=0x14bd6520, scount=1, MPI_INT, rbuf=0x14bda3e0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(520)...............: &lt;BR /&gt;MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 8&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x1301520, scount=1, MPI_INT, rbuf=0x130dcf0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(210)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x19e8c520, scount=1, MPI_INT, rbuf=0x19e903b0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(288)............: &lt;BR /&gt;MPIC_Recv(87)..................: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0xfb6b520, scount=1, MPI_INT, rbuf=0xfb6f3b0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(210)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x17774520, scount=1, MPI_INT, rbuf=0x177783e0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(520)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;Fatal error in PMPI_Allgather: Other MPI error, error stack:&lt;BR /&gt;PMPI_Allgather(1671)...........: MPI_Allgather(sbuf=0x156ac520, scount=1, MPI_INT, rbuf=0x156b03e0, rcount=1, MPI_INT, MPI_COMM_WORLD) failed&lt;BR /&gt;MPIR_Allgather(520)............: &lt;BR /&gt;MPIC_Sendrecv(172).............: &lt;BR /&gt;MPIC_Wait(416).................: &lt;BR /&gt;MPIDI_CH3I_Progress(401).......: &lt;BR /&gt;MPID_nem_tcp_poll(2332)........: &lt;BR /&gt;MPID_nem_tcp_connpoll(2582)....: &lt;BR /&gt;state_commrdy_handler(2208)....: &lt;BR /&gt;MPID_nem_tcp_recv_handler(2081): socket closed&lt;BR /&gt;rank 5 in job 1 q12_47899 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 5: return code 1 &lt;BR /&gt;rank 4 in job 1 q12_47899 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 4: return code 1 &lt;BR /&gt;rank 3 in job 1 q12_47899 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 3: return code 1 &lt;BR /&gt;rank 1 in job 1 q12_47899 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 1: return code 1 &lt;BR /&gt;rank 0 in job 1 q12_47899 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 0: return code 1 &lt;BR /&gt;&lt;BR /&gt;&lt;/STDLIB.H&gt;&lt;/STDIO.H&gt;&lt;/MPI.H&gt;</description>
      <pubDate>Mon, 21 Feb 2011 18:16:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797041#M736</guid>
      <dc:creator>Craig_Artley</dc:creator>
      <dc:date>2011-02-21T18:16:15Z</dc:date>
    </item>
    <item>
      <title>Crash or deadlock in MPI_Allgather</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797042#M737</link>
      <description>Hi Craig,&lt;BR /&gt;&lt;BR /&gt;In the debug output you can see that hardware used on different nodes is different:&lt;BR /&gt;[0] MPI startup(): Recognition level=1. Platform code=1. Device=4&lt;BR /&gt;[3] MPI startup(): Recognition level=1. Platform code=2. Device=4&lt;BR /&gt;&lt;BR /&gt;It means that different algorithms can be used on different nodes. And as the result an application may hang.&lt;BR /&gt;In such cases you can use enviroment variable I_MPI_PLATFORM:&lt;BR /&gt; export I_MPI_PLATFORM=auto&lt;BR /&gt;&lt;BR /&gt;Startup phase will be a bit longer, but appplication will work.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt; Dmitry&lt;BR /&gt;</description>
      <pubDate>Tue, 22 Feb 2011 07:41:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797042#M737</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2011-02-22T07:41:43Z</dc:date>
    </item>
    <item>
      <title>Crash or deadlock in MPI_Allgather</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797043#M738</link>
      <description>That's great! Thanks very much. That fixes the test program, and more importantly, it fixes our product's entire test suite. &lt;BR /&gt;&lt;BR /&gt;I could not find this switch in the Reference Manual (4.0.1.007). Is it documented anywhere? I have to say that it seems like an odd default to fail in this way.&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt; -craig</description>
      <pubDate>Tue, 22 Feb 2011 16:28:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797043#M738</guid>
      <dc:creator>Craig_Artley</dc:creator>
      <dc:date>2011-02-22T16:28:28Z</dc:date>
    </item>
    <item>
      <title>Crash or deadlock in MPI_Allgather</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797044#M739</link>
      <description>This is new option appeared in 4.0.1 and we haven't agreed the final naming yet that is why this option was not added into the Reference Manual.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt; Dmitry&lt;BR /&gt;</description>
      <pubDate>Thu, 24 Feb 2011 12:37:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Crash-or-deadlock-in-MPI-Allgather/m-p/797044#M739</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2011-02-24T12:37:50Z</dc:date>
    </item>
  </channel>
</rss>

