<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic MPI program behavior on node crash ... in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776909#M287</link>
    <description>&lt;P&gt;In the production environment, it happens that some nodes crash once in a while. What's the behavior of Intel's MPI when an MPI program encounters lost contact of some of its processes? Would there be any difference if the node crashed contains rank 0? Is there any option of Intel's MPI to control the behavior of such situation so that the program will be cleaned up in case one of the MPI processesis lost?Thank you very much,Tofu&lt;/P&gt;</description>
    <pubDate>Mon, 30 Jul 2012 01:38:00 GMT</pubDate>
    <dc:creator>Tofu</dc:creator>
    <dc:date>2012-07-30T01:38:00Z</dc:date>
    <item>
      <title>MPI program behavior on node crash ...</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776909#M287</link>
      <description>&lt;P&gt;In the production environment, it happens that some nodes crash once in a while. What's the behavior of Intel's MPI when an MPI program encounters lost contact of some of its processes? Would there be any difference if the node crashed contains rank 0? Is there any option of Intel's MPI to control the behavior of such situation so that the program will be cleaned up in case one of the MPI processesis lost?Thank you very much,Tofu&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jul 2012 01:38:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776909#M287</guid>
      <dc:creator>Tofu</dc:creator>
      <dc:date>2012-07-30T01:38:00Z</dc:date>
    </item>
    <item>
      <title>MPI program behavior on node crash ...</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776910#M288</link>
      <description>Hi Tofu,&lt;BR /&gt;&lt;BR /&gt;If a node containing a process crashes, the entire job will end. You can use the -cleanup option (or I_MPI_HYDRA_CLEANUP) to create a temporary file that will list the PID of each process, and the mpicleanup utility will use this file to clean the environment if the job does not end correctly. You can also use I_MPI_MPIRUN_CLEANUP if you are using MPD instead of Hydra.&lt;BR /&gt;&lt;BR /&gt;Sincerely,&lt;BR /&gt;James Tullos&lt;BR /&gt;Technical Consulting Engineer&lt;BR /&gt;Intel Cluster Tools</description>
      <pubDate>Mon, 30 Jul 2012 14:34:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776910#M288</guid>
      <dc:creator>James_T_Intel</dc:creator>
      <dc:date>2012-07-30T14:34:40Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776911#M289</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I found a similar situation that the mpirun command does not terminate even some of the processes do not start up properly. &amp;nbsp;Here I have two nodes p0 and p1&amp;nbsp;running the program /opt/intel/impi/4.1.0.024/test/test.f90. &amp;nbsp;Here is what I've done:&lt;/P&gt;

&lt;P&gt;cp&amp;nbsp;/opt/intel/impi/4.1.0.024/test/test.f90 /path/to/shared/storage&lt;/P&gt;

&lt;P&gt;cd&amp;nbsp;/path/to/shared/storage&lt;/P&gt;

&lt;P&gt;mpiifort&amp;nbsp;test.f90&lt;/P&gt;

&lt;P&gt;mpirun -hosts p0,p1 -n 32 ./a.out&lt;/P&gt;

&lt;P&gt;&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;4 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;5 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;6 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;7 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;8 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;9 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 10 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 11 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 12 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 13 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 14 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 15 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p01&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 16 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 17 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 18 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 19 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 20 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 21 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 22 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 23 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 24 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 25 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 26 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 27 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 28 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 29 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 30 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Hello world: rank &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 31 &amp;nbsp;of &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp;running on&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;p02&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;​Now, on p02, I umount the shared storage and then issue the command again:&lt;/P&gt;

&lt;P&gt;mpirun -hosts p01,p02 -n 32 ./a.out&lt;/P&gt;

&lt;P&gt;[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;BR /&gt;
	[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)&lt;/P&gt;

&lt;P&gt;However, the mpirun process is not terminated and the ps&amp;nbsp;tree shows the following:&lt;/P&gt;

&lt;P&gt;100776 pts/14 &amp;nbsp; S &amp;nbsp; &amp;nbsp; &amp;nbsp;0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp;\_ /bin/sh /opt/intel/impi/4.1.0.024/intel64/bin/mpirun -hosts p01,p02 -ppn 1 -n 2 ./a.out&lt;BR /&gt;
	100781 pts/14 &amp;nbsp; S &amp;nbsp; &amp;nbsp; &amp;nbsp;0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp;| &amp;nbsp; \_ mpiexec.hydra -hosts p01 p02 -ppn 1 -n 2 ./a.out&lt;BR /&gt;
	100782 pts/14 &amp;nbsp; S &amp;nbsp; &amp;nbsp; &amp;nbsp;0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp;| &amp;nbsp; &amp;nbsp; &amp;nbsp; \_ /usr/bin/ssh -x -q p01 /opt/intel/impi/4.1.0.024/intel64/bin/pmi_proxy --control-port metro:36671 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --proxy-id 0&lt;BR /&gt;
	100783 pts/14 &amp;nbsp; Z &amp;nbsp; &amp;nbsp; &amp;nbsp;0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp;| &amp;nbsp; &amp;nbsp; &amp;nbsp; \_ [ssh] &amp;lt;defunct&amp;gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Just wonder if there is any option that can help in this situation so that mpirun&amp;nbsp;can terminate properly instead of hanging.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;regards,&lt;/P&gt;

&lt;P&gt;C. Bean&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Dec 2013 07:44:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776911#M289</guid>
      <dc:creator>YY_C_</dc:creator>
      <dc:date>2013-12-02T07:44:00Z</dc:date>
    </item>
    <item>
      <title>Could you use mpdallexit?</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776912#M290</link>
      <description>&lt;P&gt;Could you use mpdallexit?&lt;/P&gt;</description>
      <pubDate>Mon, 02 Dec 2013 18:23:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776912#M290</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-12-02T18:23:28Z</dc:date>
    </item>
    <item>
      <title>We have corrected some</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776913#M291</link>
      <description>&lt;P&gt;We have corrected some problems related to ranks not exiting correctly.&amp;nbsp; Please try with Version 4.1 Update 2 and see if this resolves the problem.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Dec 2013 22:09:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776913#M291</guid>
      <dc:creator>James_T_Intel</dc:creator>
      <dc:date>2013-12-02T22:09:56Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776914#M292</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Our&amp;nbsp;situation is slightly different but still encountered similar problem; though we're using the 4.1 version update 2. &amp;nbsp;We started running HPL benchmark test and one of the node was&amp;nbsp;crashed in the middle. &amp;nbsp;However, the mpiexec.hydra does not terminate:&lt;/P&gt;

&lt;P&gt;&amp;nbsp;28351 pts/4 &amp;nbsp; &amp;nbsp;Ss &amp;nbsp; &amp;nbsp; 0:00 &amp;nbsp;\_ /bin/bash&lt;BR /&gt;
	&amp;nbsp;32295 pts/4 &amp;nbsp; &amp;nbsp;S+ &amp;nbsp; &amp;nbsp; 0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp;\_ /bin/sh /opt/intel/impi/4.1.2.040/intel64/bin/mpirun -hosts node107,node213 -n 32 ./xhpl_intel64_dynamic&lt;BR /&gt;
	&amp;nbsp;32300 pts/4 &amp;nbsp; &amp;nbsp;S+ &amp;nbsp; &amp;nbsp; 0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;\_ mpiexec.hydra -hosts node107 node213 -n 32 ./xhpl_intel64_dynamic&lt;BR /&gt;
	&amp;nbsp;32301 pts/4 &amp;nbsp; &amp;nbsp;Z &amp;nbsp; &amp;nbsp; &amp;nbsp;0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;\_ [ssh] &amp;lt;defunct&amp;gt;&lt;BR /&gt;
	&amp;nbsp;32302 pts/4 &amp;nbsp; &amp;nbsp;S &amp;nbsp; &amp;nbsp; &amp;nbsp;0:00 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;\_ /usr/bin/ssh -x -q node213 /opt/intel/impi/4.1.2.040/intel64/bin/pmi_proxy --control-port master:49817 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1138594473 --proxy-id 1&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Any clue?&lt;/P&gt;

&lt;P&gt;regards,&lt;/P&gt;

&lt;P&gt;Tofu&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 03 Dec 2013 03:50:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776914#M292</guid>
      <dc:creator>Tofu</dc:creator>
      <dc:date>2013-12-03T03:50:23Z</dc:date>
    </item>
    <item>
      <title>The 4.1 update 2 Intel MPI</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776915#M293</link>
      <description>&lt;P&gt;The 4.1 update 2 Intel MPI works fine for the initial start up issue; i.e., if a node does not mount the shared storage, the mpirun terminates properly.&lt;/P&gt;

&lt;P&gt;We also tried unplugging a compute node in the middle of a run and found mpirun hangs with [ssh] &amp;lt;defunct&amp;gt;.&amp;nbsp; Any way to cause mpirun terminates in such situation?&lt;/P&gt;

&lt;P&gt;regards,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;C. Bean&lt;/P&gt;</description>
      <pubDate>Wed, 04 Dec 2013 07:48:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776915#M293</guid>
      <dc:creator>YY_C_</dc:creator>
      <dc:date>2013-12-04T07:48:43Z</dc:date>
    </item>
    <item>
      <title>Any update on this issue?  We</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776916#M294</link>
      <description>&lt;P&gt;Any update on this issue? &amp;nbsp;We tried compiling application&amp;nbsp;using MVAPICH2 and their mpiexec.hydra, the whole application is terminated&amp;nbsp;whenever a compute node is down.&lt;/P&gt;

&lt;P&gt;regards,&lt;/P&gt;

&lt;P&gt;tofu&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2013 03:12:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-program-behavior-on-node-crash/m-p/776916#M294</guid>
      <dc:creator>Tofu</dc:creator>
      <dc:date>2013-12-10T03:12:57Z</dc:date>
    </item>
  </channel>
</rss>

