<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How does Intel MPI handle network failures in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901548#M2214</link>
    <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/449238"&gt;dludick&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;Hi all, &lt;BR /&gt;&lt;BR /&gt;I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).&lt;BR /&gt;&lt;BR /&gt;What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after anumber of (set) seconds?&lt;BR /&gt;&lt;BR /&gt;Any help would be very much appreciated!&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Hi Dludick,&lt;BR /&gt;&lt;BR /&gt;You are quite right - there is no way to restore connection after unexpected network problem.&lt;BR /&gt;But you could try to implement sort of fault tolerance by setting error hadler to MPI_ERROR_RETURNS. A user should test the return code of MPI calls that invoke the error handler MPI_ERROR_RETURNS, and execute suitable&lt;BR /&gt;recovery code when the call was unsuccessful. But it depeneds on the device driver.&lt;BR /&gt;If you call MPI_Receive and there is no connection the application will rather hang than you'll get an error.&lt;BR /&gt;&lt;BR /&gt;Intel MPI Library 4.0 will have fault tolerance implementation but you need to design you application so that it will be able to recover after a network fault.&lt;BR /&gt;&lt;BR /&gt;Best wishes,&lt;BR /&gt; Dmitry&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
    <pubDate>Tue, 27 Oct 2009 12:30:48 GMT</pubDate>
    <dc:creator>Dmitry_K_Intel2</dc:creator>
    <dc:date>2009-10-27T12:30:48Z</dc:date>
    <item>
      <title>How does Intel MPI handle network failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901547#M2213</link>
      <description>Hi all, &lt;BR /&gt;&lt;BR /&gt;I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).&lt;BR /&gt;&lt;BR /&gt;What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after anumber of (set) seconds?&lt;BR /&gt;&lt;BR /&gt;Any help would be very much appreciated!</description>
      <pubDate>Tue, 27 Oct 2009 10:53:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901547#M2213</guid>
      <dc:creator>dludick</dc:creator>
      <dc:date>2009-10-27T10:53:40Z</dc:date>
    </item>
    <item>
      <title>Re: How does Intel MPI handle network failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901548#M2214</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/449238"&gt;dludick&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;Hi all, &lt;BR /&gt;&lt;BR /&gt;I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).&lt;BR /&gt;&lt;BR /&gt;What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after anumber of (set) seconds?&lt;BR /&gt;&lt;BR /&gt;Any help would be very much appreciated!&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Hi Dludick,&lt;BR /&gt;&lt;BR /&gt;You are quite right - there is no way to restore connection after unexpected network problem.&lt;BR /&gt;But you could try to implement sort of fault tolerance by setting error hadler to MPI_ERROR_RETURNS. A user should test the return code of MPI calls that invoke the error handler MPI_ERROR_RETURNS, and execute suitable&lt;BR /&gt;recovery code when the call was unsuccessful. But it depeneds on the device driver.&lt;BR /&gt;If you call MPI_Receive and there is no connection the application will rather hang than you'll get an error.&lt;BR /&gt;&lt;BR /&gt;Intel MPI Library 4.0 will have fault tolerance implementation but you need to design you application so that it will be able to recover after a network fault.&lt;BR /&gt;&lt;BR /&gt;Best wishes,&lt;BR /&gt; Dmitry&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 27 Oct 2009 12:30:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901548#M2214</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2009-10-27T12:30:48Z</dc:date>
    </item>
    <item>
      <title>Re: How does Intel MPI handle network failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901549#M2215</link>
      <description>&lt;B&gt;&lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=423452" class="basic" href="https://community.intel.com/../profile/423452/"&gt;Dmitry Kuzmin &lt;/A&gt;&lt;/B&gt;, &lt;BR /&gt;&lt;B&gt;&lt;BR /&gt;I'm working on a fault-tolerant MPI Program. &lt;BR /&gt;&lt;BR /&gt;Suppose this situation: You have 2 nodes A and B. A
 sends messages to B. Besides, B failed in some time and a message from A
 to B was already sent. In this case, TCP will try to retransmit this 
message until a certain number of times, 15, by default. I had to 
increase this number, because after 15 times is reached, TCP gives an 
error and pass it to MPI layer. I noted that Intel MPI aborts 
application. When I increased &lt;/B&gt;&lt;B&gt;variable &lt;/B&gt;&lt;B&gt;tcp_retries2 on Linux to 60000 value, my fault tolerant mechanism worked until the end of application. This works for small number of process. For 128 (divided by 16 nodes) and 256 process (divided by 32 nodes).&lt;BR /&gt;&lt;BR /&gt;So, my question is: Do you know if there is any way to make Intel MPI not to abort when receive an TCP error because a connection was closed? &lt;BR /&gt;&lt;BR /&gt;The error is: &lt;/B&gt;Assertion failed in file ../../socksm.c at line 2573: (it_plfd-&amp;gt;revents &amp;amp; 0x008) == 0&lt;BR /&gt;internal ABORT - process 170&lt;BR /&gt;&lt;B&gt;&lt;BR /&gt;Another question : Do you know a good way to clean all the communication channels in MPI? I did an clean procedure that does an MPI_Probe and receives the messages that were not received before because a failed node. &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Hugs,&lt;BR /&gt;matheusbersot.&lt;/B&gt;</description>
      <pubDate>Fri, 22 Apr 2011 13:54:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901549#M2215</guid>
      <dc:creator>matheusbersot</dc:creator>
      <dc:date>2011-04-22T13:54:23Z</dc:date>
    </item>
    <item>
      <title>Re: How does Intel MPI handle network failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901550#M2216</link>
      <description>Hi &lt;B&gt;matheusbersot,&lt;BR /&gt;&lt;/B&gt;&lt;BR /&gt;Have you tried to use '-env &lt;B&gt;I_MPI_FAULT_CONTINUE&lt;/B&gt;=on'? &lt;BR /&gt;If you set this environment variable and handle &lt;B&gt;MPI_ERROR_RETURN&lt;/B&gt; you should see this message:&lt;BR /&gt;&lt;B&gt;The error is: &lt;/B&gt;Assertion failed in file ../../socksm.c at line 2573: (it_plfd-&amp;gt;revents &amp;amp; 0x008) == 0&lt;BR /&gt;internal ABORT - process 170&lt;BR /&gt;Error state is returned into &lt;B&gt;error_string&lt;/B&gt; and program execution can continue.&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt; Dmitry&lt;BR /&gt;</description>
      <pubDate>Mon, 25 Apr 2011 07:08:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/How-does-Intel-MPI-handle-network-failures/m-p/901550#M2216</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2011-04-25T07:08:30Z</dc:date>
    </item>
  </channel>
</rss>

