<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic MPI errors on large OPA fabric in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/MPI-errors-on-large-OPA-fabric/m-p/1117226#M7491</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;We're getting MPI communication errors using Intel MPI on our cluster using omnipath.&amp;nbsp; This is a job using 931 nodes, smaller runs using 600 nodes execute properly.&lt;/P&gt;

&lt;P&gt;Other details:&lt;/P&gt;

&lt;P&gt;We're using Intel Parallel Studio 2017 update 4 (compilers_and_libraries_2017.4.196).&lt;/P&gt;

&lt;P&gt;There are 1024 total nodes on the fabric, we would like to run jobs utilizing the entire cluster.&lt;/P&gt;

&lt;P&gt;This is an HPL run using Intel l_mklb_p_2017.3.017.&lt;/P&gt;

&lt;P&gt;This is an example of the errors we see - what is interesting is the buffer and target size is the same, however the error states it is truncated.&amp;nbsp; Is there normally a header the target buffer needs to have space for?&lt;/P&gt;

&lt;P&gt;Fatal error in MPI_Recv: Message truncated, error stack:&lt;BR /&gt;
	MPI_Recv(224)................: MPI_Recv(buf=0x2b1ee8401840, count=1455, MPI_DOUBLE, src=17, tag=10001, comm=0x84000002, status=0x7ffef5ddfe50) failed&lt;BR /&gt;
	MPID_nem_tmi_handle_rreq(738): Message from rank 17 and tag 10001 truncated; 11640 bytes received but buffer size is 11640&lt;BR /&gt;
	Fatal error in MPI_Sendrecv: Message truncated, error stack:&lt;BR /&gt;
	MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b93ba000000, scount=1164, MPI_DOUBLE, dest=13, stag=10001, rbuf=0x2b93ba002460, rcount=1746, MPI_DOUBLE, src=13, rtag=10001, comm=0x84000002, status=0x7ffcec3f3f50) failed&lt;BR /&gt;
	MPID_nem_tmi_handle_rreq(738): Message from rank 13 and tag 10001 truncated; 13968 bytes received but buffer size is 13968&lt;BR /&gt;
	Fatal error in MPI_Sendrecv: Message truncated, error stack:&lt;BR /&gt;
	MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b30f5880808, scount=24576, MPI_DOUBLE, dest=16, stag=10001, rbuf=0x2b30ef400000, rcount=1164, MPI_DOUBLE, src=16, rtag=10001, comm=0x84000002, status=0x7ffc4278ec10) failed&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sat, 27 May 2017 14:41:03 GMT</pubDate>
    <dc:creator>mcs-systems</dc:creator>
    <dc:date>2017-05-27T14:41:03Z</dc:date>
    <item>
      <title>MPI errors on large OPA fabric</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/MPI-errors-on-large-OPA-fabric/m-p/1117226#M7491</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;We're getting MPI communication errors using Intel MPI on our cluster using omnipath.&amp;nbsp; This is a job using 931 nodes, smaller runs using 600 nodes execute properly.&lt;/P&gt;

&lt;P&gt;Other details:&lt;/P&gt;

&lt;P&gt;We're using Intel Parallel Studio 2017 update 4 (compilers_and_libraries_2017.4.196).&lt;/P&gt;

&lt;P&gt;There are 1024 total nodes on the fabric, we would like to run jobs utilizing the entire cluster.&lt;/P&gt;

&lt;P&gt;This is an HPL run using Intel l_mklb_p_2017.3.017.&lt;/P&gt;

&lt;P&gt;This is an example of the errors we see - what is interesting is the buffer and target size is the same, however the error states it is truncated.&amp;nbsp; Is there normally a header the target buffer needs to have space for?&lt;/P&gt;

&lt;P&gt;Fatal error in MPI_Recv: Message truncated, error stack:&lt;BR /&gt;
	MPI_Recv(224)................: MPI_Recv(buf=0x2b1ee8401840, count=1455, MPI_DOUBLE, src=17, tag=10001, comm=0x84000002, status=0x7ffef5ddfe50) failed&lt;BR /&gt;
	MPID_nem_tmi_handle_rreq(738): Message from rank 17 and tag 10001 truncated; 11640 bytes received but buffer size is 11640&lt;BR /&gt;
	Fatal error in MPI_Sendrecv: Message truncated, error stack:&lt;BR /&gt;
	MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b93ba000000, scount=1164, MPI_DOUBLE, dest=13, stag=10001, rbuf=0x2b93ba002460, rcount=1746, MPI_DOUBLE, src=13, rtag=10001, comm=0x84000002, status=0x7ffcec3f3f50) failed&lt;BR /&gt;
	MPID_nem_tmi_handle_rreq(738): Message from rank 13 and tag 10001 truncated; 13968 bytes received but buffer size is 13968&lt;BR /&gt;
	Fatal error in MPI_Sendrecv: Message truncated, error stack:&lt;BR /&gt;
	MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b30f5880808, scount=24576, MPI_DOUBLE, dest=16, stag=10001, rbuf=0x2b30ef400000, rcount=1164, MPI_DOUBLE, src=16, rtag=10001, comm=0x84000002, status=0x7ffc4278ec10) failed&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 27 May 2017 14:41:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/MPI-errors-on-large-OPA-fabric/m-p/1117226#M7491</guid>
      <dc:creator>mcs-systems</dc:creator>
      <dc:date>2017-05-27T14:41:03Z</dc:date>
    </item>
    <item>
      <title>The question seems more</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/MPI-errors-on-large-OPA-fabric/m-p/1117227#M7492</link>
      <description>&lt;P&gt;The question seems more appropriate to the cluster hpc forum, if you could quote intel cluster checker diagnoses.&lt;/P&gt;</description>
      <pubDate>Sat, 27 May 2017 18:16:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/MPI-errors-on-large-OPA-fabric/m-p/1117227#M7492</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-05-27T18:16:27Z</dc:date>
    </item>
  </channel>
</rss>

