<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Analyzing mpd Ring Failures in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799716#M798</link>
    <description>Hello,&lt;BR /&gt;&lt;BR /&gt;we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:&lt;BR /&gt;&lt;BR /&gt;startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes&lt;BR /&gt;node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring&lt;BR /&gt;...&lt;BR /&gt;node26-22_42619: connection error in connect_lhs call: Connection refused&lt;BR /&gt;node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks for any insight,&lt;BR /&gt;&lt;BR /&gt;A.&lt;BR /&gt;</description>
    <pubDate>Mon, 14 Jun 2010 16:01:55 GMT</pubDate>
    <dc:creator>Ansgar_Esztermann</dc:creator>
    <dc:date>2010-06-14T16:01:55Z</dc:date>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799716#M798</link>
      <description>Hello,&lt;BR /&gt;&lt;BR /&gt;we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:&lt;BR /&gt;&lt;BR /&gt;startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes&lt;BR /&gt;node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring&lt;BR /&gt;...&lt;BR /&gt;node26-22_42619: connection error in connect_lhs call: Connection refused&lt;BR /&gt;node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks for any insight,&lt;BR /&gt;&lt;BR /&gt;A.&lt;BR /&gt;</description>
      <pubDate>Mon, 14 Jun 2010 16:01:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799716#M798</guid>
      <dc:creator>Ansgar_Esztermann</dc:creator>
      <dc:date>2010-06-14T16:01:55Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799717#M799</link>
      <description>&lt;DIV id="tiny_quote"&gt;
                &lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=481516" class="basic" href="https://community.intel.com/en-us/profile/481516/"&gt;Ansgar Esztermann&lt;/A&gt;&lt;/DIV&gt;
                &lt;DIV style="border: 1px inset; padding: 5px; background-color: #e5e5e5; margin-left: 2px; margin-right: 2px;"&gt;&lt;I&gt;Hello,&lt;BR /&gt;&lt;BR /&gt;we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:&lt;BR /&gt;&lt;BR /&gt;startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes&lt;BR /&gt;node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring&lt;BR /&gt;...&lt;BR /&gt;node26-22_42619: connection error in connect_lhs call: Connection refused&lt;BR /&gt;node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks for any insight,&lt;BR /&gt;&lt;BR /&gt;A.&lt;BR /&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;/P&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;is this happening immediately or only after some runtime of the job? You are using Ethernet?&lt;BR /&gt;&lt;BR /&gt;-- Reuti</description>
      <pubDate>Wed, 16 Jun 2010 09:04:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799717#M799</guid>
      <dc:creator>reuti_at_intel</dc:creator>
      <dc:date>2010-06-16T09:04:12Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799718#M800</link>
      <description>Hi Ansgar,&lt;BR /&gt;&lt;BR /&gt;You are right - this issue looks like a communication problem. mpd running on node node26-05 lost connection with the ring.&lt;BR /&gt;&lt;BR /&gt;What MPI version do you use? How often does this happen? Have you seen such problem on 4, 8, 12 nodes?&lt;BR /&gt;&lt;BR /&gt;You can take a look at the /tmp/mpd2.logfile_username_xxxxx file on each node. Probably you'll find some information there.&lt;BR /&gt;&lt;BR /&gt;Also there is --verbose option - try it out.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Regards!&lt;BR /&gt; Dmitry&lt;BR /&gt;</description>
      <pubDate>Wed, 16 Jun 2010 10:34:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799718#M800</guid>
      <dc:creator>Dmitry_K_Intel2</dc:creator>
      <dc:date>2010-06-16T10:34:34Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799719#M801</link>
      <description>Hi Reuti,&lt;BR /&gt;&lt;BR /&gt;this is happening more or less immediately -- it does not take more than, say, 30 seconds from the start of the job to the first error messages. &lt;BR /&gt;Yes, we are using Ethernet. For some reason, Infiniband jobs are working fine.&lt;BR /&gt;&lt;BR /&gt;I have learned one more thing since posting the question: something tries to make node-to-node ssh connections shortly after a job is started. These will fail (we have disabled ssh for non-admin users); however, no such thing happens for IB jobs.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;A.</description>
      <pubDate>Wed, 16 Jun 2010 12:43:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799719#M801</guid>
      <dc:creator>Ansgar_Esztermann</dc:creator>
      <dc:date>2010-06-16T12:43:37Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799720#M802</link>
      <description>Okay. As the ring is initially created fine, it seems to happen inside the jobscript. Do you set:&lt;BR /&gt;&lt;BR /&gt;export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"&lt;BR /&gt;&lt;BR /&gt;in the jobscript, so that the job-specific console can be reached?&lt;BR /&gt;&lt;BR /&gt;I wonder, whether there is any "reconnection" facility built into `mpiexec`, which I'm not aware of.&lt;BR /&gt;&lt;BR /&gt;Do you test with the small `mpihello` program or your final application?&lt;BR /&gt;&lt;BR /&gt;-- Reuti</description>
      <pubDate>Wed, 16 Jun 2010 13:23:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799720#M802</guid>
      <dc:creator>reuti_at_intel</dc:creator>
      <dc:date>2010-06-16T13:23:18Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799721#M803</link>
      <description>Yes, MPD_CON_EXT is set. I am testing both with a real-world application and a simple sleep (no MPI commands or even mpiexec involved).&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;A.&lt;BR /&gt;</description>
      <pubDate>Wed, 16 Jun 2010 13:54:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799721#M803</guid>
      <dc:creator>Ansgar_Esztermann</dc:creator>
      <dc:date>2010-06-16T13:54:17Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799722#M804</link>
      <description>As time of writing the Howto the main console was always placed in /tmp (this was hardcoded at that time). Nowadays it can be directed to be anywhere by an environment variable or command line option to several of the mpd/mpi* commands.&lt;BR /&gt;&lt;BR /&gt;I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:&lt;BR /&gt;&lt;BR /&gt;export MPD_TMPDIR=$TMPDIR&lt;BR /&gt;&lt;BR /&gt;which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.&lt;BR /&gt;&lt;BR /&gt;-- Reuti&lt;BR /&gt;</description>
      <pubDate>Wed, 16 Jun 2010 14:38:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799722#M804</guid>
      <dc:creator>reuti_at_intel</dc:creator>
      <dc:date>2010-06-16T14:38:04Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799723#M805</link>
      <description>Hi Dmitry,&lt;BR /&gt;&lt;BR /&gt;this is with MPI 3.2.1.009. It happens almost every time with large jobs (32 nodes with 4 cores each) and sometimes (maybe every two runs) with smaller jobs (2x4 cores, 4x4 cores).&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;&lt;BR /&gt;A.&lt;BR /&gt;&lt;BR /&gt;[Edit: mpd does not have a --verbose option]</description>
      <pubDate>Wed, 16 Jun 2010 15:27:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799723#M805</guid>
      <dc:creator>Ansgar_Esztermann</dc:creator>
      <dc:date>2010-06-16T15:27:03Z</dc:date>
    </item>
    <item>
      <title>Analyzing mpd Ring Failures</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799724#M806</link>
      <description>Thanks, Reuti, that did it!</description>
      <pubDate>Tue, 13 Jul 2010 15:12:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Analyzing-mpd-Ring-Failures/m-p/799724#M806</guid>
      <dc:creator>Ansgar_Esztermann</dc:creator>
      <dc:date>2010-07-13T15:12:40Z</dc:date>
    </item>
  </channel>
</rss>

