<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Application crashes when run on 2 nodes (caused collective abor in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777011#M299</link>
    <description>Hi Kunal,&lt;BR /&gt;&lt;BR /&gt;Having only information about MPI library it's hardly possible to say anything about this issue.&lt;BR /&gt;It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.&lt;BR /&gt;As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."&lt;BR /&gt;Do you see the same issue using less cores? Is your issue absolutely resproducable?&lt;BR /&gt;BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.&lt;BR /&gt;Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt;---Dmitry&lt;BR /&gt;</description>
    <pubDate>Tue, 13 Dec 2011 06:16:57 GMT</pubDate>
    <dc:creator>Dmitry_K_Intel2</dc:creator>
    <dc:date>2011-12-13T06:16:57Z</dc:date>
    <item>
      <title>Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777010#M298</link>
      <description>Hi,&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;We have a huge HPC application compiled with Intel compiler and uses Intel MPI Library. It works fine when run on single node (with multiple processes) but crashes when run on 2 nodes (with multiple processes) with the following message :&lt;BR /&gt;&lt;BR /&gt; -------------&lt;BR /&gt; rank 63 in job 1 blade4_34649 caused collective abort of all ranks&lt;BR /&gt; exit status of rank 63: killed by signal 9&lt;BR /&gt;&lt;BR /&gt; ---&lt;BR /&gt; ---------------&lt;BR /&gt;&lt;BR /&gt; I'm not sure if it is Intel MPI related error or an error in the application.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt; Some info related to Intel MPI that we are using and the mpd ring consisting of 2 nodes.&lt;BR /&gt;&lt;BR /&gt; -------------------&lt;BR /&gt; [kunal@GPUBlade exp]$ which mpirun&lt;BR /&gt; /opt/intel/impi/4.0.1.007/intel64/bin/mpirun&lt;BR /&gt;&lt;BR /&gt; [kunal@GPUBlade exp]$ mpirun --version&lt;BR /&gt; Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910&lt;BR /&gt; Copyright (C) 2003-2010 Intel Corporation. All rights reserved.&lt;BR /&gt;&lt;BR /&gt; [kunal@GPUBlade exp]$ mpdtrace -l&lt;BR /&gt; GPUBlade_37085 (GPUBlade)&lt;BR /&gt; blade4_57372 (192.168.1.102)&lt;BR /&gt;&lt;BR /&gt;-------------------&lt;BR /&gt;&lt;BR /&gt; Any suggestions on how do I go about debugging this error ?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Thanks &amp;amp; Regards,&lt;BR /&gt;Kunal&lt;/DIV&gt;</description>
      <pubDate>Tue, 13 Dec 2011 04:48:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777010#M298</guid>
      <dc:creator>Kunal_Rao</dc:creator>
      <dc:date>2011-12-13T04:48:07Z</dc:date>
    </item>
    <item>
      <title>Application crashes when run on 2 nodes (caused collective abor</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777011#M299</link>
      <description>Hi Kunal,&lt;BR /&gt;&lt;BR /&gt;Having only information about MPI library it's hardly possible to say anything about this issue.&lt;BR /&gt;It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.&lt;BR /&gt;As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."&lt;BR /&gt;Do you see the same issue using less cores? Is your issue absolutely resproducable?&lt;BR /&gt;BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.&lt;BR /&gt;Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt;---Dmitry&lt;BR /&gt;</description>
      <pubDate>Tue, 13 Dec 2011 06:16:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777011#M299</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2011-12-13T06:16:57Z</dc:date>
    </item>
    <item>
      <title>Application crashes when run on 2 nodes (caused collective abor</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777012#M300</link>
      <description>Thanks Dmitry for your reply. Your suggestions were helpful. I was able to give a run with those extra debugging flags and was able to get some more insight into the problem.&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The application crashes with the following message inmpi_comm_dup_MPI call in the application :&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;----------&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;[0] ERROR: LOCAL:MPI:CALL_FAILED: error&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;[0] ERROR:  Invalid communicator.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;[0] ERROR:  Error occurred at:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;[0] ERROR:    mpi_comm_dup_(comm=0xffffffffc4000000 &amp;lt;&lt;INVALID&gt;&amp;gt;, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)&lt;/INVALID&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;[0] ERROR: LOCAL:MPI:CALL_FAILED: error[0] ERROR:  Invalid communicator.[0] ERROR:  Error occurred at:[0] ERROR:    mpi_comm_dup_(comm=0xffffffffc4000000 &amp;lt;&lt;INVALID&gt;&amp;gt;, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)&lt;/INVALID&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;---------&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I'll look more into it. Let me know if you have some further suggestions.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Thanks &amp;amp; Regards,&lt;/DIV&gt;&lt;DIV&gt;Kunal&lt;/DIV&gt;</description>
      <pubDate>Thu, 15 Dec 2011 04:02:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777012#M300</guid>
      <dc:creator>Kunal_Rao</dc:creator>
      <dc:date>2011-12-15T04:02:35Z</dc:date>
    </item>
    <item>
      <title>Application crashes when run on 2 nodes (caused collective abor</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777013#M301</link>
      <description>Kunal,&lt;BR /&gt;&lt;BR /&gt;Looks like first argument of function MPI_COMM_DUP is incorrect.&lt;BR /&gt;As an example: MPI_COMM_DUP(MPI_COMM_WORLD, new_comm, ierr)&lt;BR /&gt;The arg should be INTEGER.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt; Dmitry&lt;BR /&gt;</description>
      <pubDate>Thu, 15 Dec 2011 06:16:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777013#M301</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2011-12-15T06:16:13Z</dc:date>
    </item>
    <item>
      <title>Hi ,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777014#M302</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;

&lt;P&gt;I have compiled espresso with intel mpi and MKL library but&amp;nbsp; getting error Failure during collective error when ever it is working fine with openmpi.&lt;/P&gt;

&lt;P&gt;is there problem with intel mpi&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	Fatal error in PMPI_Bcast: Other MPI error, error stack:&lt;BR /&gt;
	PMPI_Bcast(2112)........: MPI_Bcast(buf=0x516f460, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed&lt;BR /&gt;
	MPIR_Bcast_impl(1670)...:&lt;BR /&gt;
	I_MPIR_Bcast_intra(1887): Failure during collective&lt;BR /&gt;
	MPIR_Bcast_intra(1524)..: Failure during collective&lt;BR /&gt;
	Fatal error in PMPI_Bcast: Other MPI error, error stack:&lt;BR /&gt;
	PMPI_Bcast(2112)........: MPI_Bcast(buf=0x5300310, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed&lt;BR /&gt;
	MPIR_Bcast_impl(1670)...:&lt;BR /&gt;
	I_MPIR_Bcast_intra(1887): Failure during collective&lt;BR /&gt;
	MPIR_Bcast_intra(1524)..: Failure during collective&lt;BR /&gt;
	Fatal error in PMPI_Bcast: Other MPI error, error stack:&lt;BR /&gt;
	PMPI_Bcast(2112)........: MPI_Bcast(buf=0x6b295c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed&lt;BR /&gt;
	MPIR_Bcast_impl(1670)...:&lt;BR /&gt;
	I_MPIR_Bcast_intra(1887): Failure during collective&lt;BR /&gt;
	MPIR_Bcast_intra(1524)..: Failure during collective&lt;BR /&gt;
	Fatal error in PMPI_Bcast: Other MPI error, error stack:&lt;BR /&gt;
	PMPI_Bcast(2112)........: MPI_Bcast(buf=0x67183d0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed&lt;BR /&gt;
	MPIR_Bcast_impl(1670)...:&lt;BR /&gt;
	I_MPIR_Bcast_intra(1887): Failure during collective&lt;BR /&gt;
	MPIR_Bcast_intra(1524)..: Failure during collective&lt;BR /&gt;
	Fatal error in PMPI_Bcast: Other MPI error, error stack:&lt;BR /&gt;
	PMPI_Bcast(2112)........: MPI_Bcast(buf=0x4f794c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed&lt;BR /&gt;
	MPIR_Bcast_impl(1670)...:&lt;BR /&gt;
	I_MPIR_Bcast_intra(1887): Failure during collective&lt;BR /&gt;
	MPIR_Bcast_intra(1524)..: Failure during collective&lt;BR /&gt;
	[0:n125] unexpected disconnect completion event from [22:n122]&lt;BR /&gt;
	Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0&lt;BR /&gt;
	internal ABORT - process 0&lt;BR /&gt;
	Fatal error in PMPI_Bcast: Other MPI error, error stack:&lt;BR /&gt;
	PMPI_Bcast(2112)........: MPI_Bcast(buf=0x56bfe30, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed&lt;BR /&gt;
	MPIR_Bcast_impl(1670)...:&lt;BR /&gt;
	I_MPIR_Bcast_intra(1887): Failure during collective&lt;BR /&gt;
	MPIR_Bcast_intra(1524)..: Failure during collective&lt;BR /&gt;
	/var/spool/PBS/mom_priv/epilogue: line 30: kill: (5089) - No such process&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	Kindly help us for resolving this&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	Thanks&lt;BR /&gt;
	sanjiv&lt;/P&gt;</description>
      <pubDate>Wed, 03 Jun 2015 09:32:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Application-crashes-when-run-on-2-nodes-caused-collective-abort/m-p/777014#M302</guid>
      <dc:creator>Sanjiv_T_</dc:creator>
      <dc:date>2015-06-03T09:32:52Z</dc:date>
    </item>
  </channel>
</rss>

