- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:
startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826
Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.
Thanks for any insight,
A.
we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:
startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826
Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.
Thanks for any insight,
A.
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As time of writing the Howto the main console was always placed in /tmp (this was hardcoded at that time). Nowadays it can be directed to be anywhere by an environment variable or command line option to several of the mpd/mpi* commands.
I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:
export MPD_TMPDIR=$TMPDIR
which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.
-- Reuti
I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:
export MPD_TMPDIR=$TMPDIR
which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.
-- Reuti
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting Ansgar Esztermann
Hello,
we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:
startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826
Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.
Thanks for any insight,
A.
we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:
startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826
Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.
Thanks for any insight,
A.
is this happening immediately or only after some runtime of the job? You are using Ethernet?
-- Reuti
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansgar,
You are right - this issue looks like a communication problem. mpd running on node node26-05 lost connection with the ring.
What MPI version do you use? How often does this happen? Have you seen such problem on 4, 8, 12 nodes?
You can take a look at the /tmp/mpd2.logfile_username_xxxxx file on each node. Probably you'll find some information there.
Also there is --verbose option - try it out.
Regards!
Dmitry
You are right - this issue looks like a communication problem. mpd running on node node26-05 lost connection with the ring.
What MPI version do you use? How often does this happen? Have you seen such problem on 4, 8, 12 nodes?
You can take a look at the /tmp/mpd2.logfile_username_xxxxx file on each node. Probably you'll find some information there.
Also there is --verbose option - try it out.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Reuti,
this is happening more or less immediately -- it does not take more than, say, 30 seconds from the start of the job to the first error messages.
Yes, we are using Ethernet. For some reason, Infiniband jobs are working fine.
I have learned one more thing since posting the question: something tries to make node-to-node ssh connections shortly after a job is started. These will fail (we have disabled ssh for non-admin users); however, no such thing happens for IB jobs.
A.
this is happening more or less immediately -- it does not take more than, say, 30 seconds from the start of the job to the first error messages.
Yes, we are using Ethernet. For some reason, Infiniband jobs are working fine.
I have learned one more thing since posting the question: something tries to make node-to-node ssh connections shortly after a job is started. These will fail (we have disabled ssh for non-admin users); however, no such thing happens for IB jobs.
A.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay. As the ring is initially created fine, it seems to happen inside the jobscript. Do you set:
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
in the jobscript, so that the job-specific console can be reached?
I wonder, whether there is any "reconnection" facility built into `mpiexec`, which I'm not aware of.
Do you test with the small `mpihello` program or your final application?
-- Reuti
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
in the jobscript, so that the job-specific console can be reached?
I wonder, whether there is any "reconnection" facility built into `mpiexec`, which I'm not aware of.
Do you test with the small `mpihello` program or your final application?
-- Reuti
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, MPD_CON_EXT is set. I am testing both with a real-world application and a simple sleep (no MPI commands or even mpiexec involved).
A.
A.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As time of writing the Howto the main console was always placed in /tmp (this was hardcoded at that time). Nowadays it can be directed to be anywhere by an environment variable or command line option to several of the mpd/mpi* commands.
I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:
export MPD_TMPDIR=$TMPDIR
which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.
-- Reuti
I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:
export MPD_TMPDIR=$TMPDIR
which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.
-- Reuti
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
this is with MPI 3.2.1.009. It happens almost every time with large jobs (32 nodes with 4 cores each) and sometimes (maybe every two runs) with smaller jobs (2x4 cores, 4x4 cores).
Regards,
A.
[Edit: mpd does not have a --verbose option]
this is with MPI 3.2.1.009. It happens almost every time with large jobs (32 nodes with 4 cores each) and sometimes (maybe every two runs) with smaller jobs (2x4 cores, 4x4 cores).
Regards,
A.
[Edit: mpd does not have a --verbose option]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Reuti, that did it!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page