Solved: Analyzing mpd Ring Failures

Ansgar_Esztermann · ‎06-14-2010

Hello,

we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:

startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826

Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.

Thanks for any insight,

A.

reuti_at_intel · ‎06-16-2010

As time of writing the Howto the main console was always placed in /tmp (this was hardcoded at that time). Nowadays it can be directed to be anywhere by an environment variable or command line option to several of the mpd/mpi* commands.

I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:

export MPD_TMPDIR=$TMPDIR

which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.

-- Reuti

View solution in original post

reuti_at_intel · ‎06-16-2010

Quoting Ansgar Esztermann

Hello,

we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:

startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826

Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.

Thanks for any insight,

A.

Hi,

is this happening immediately or only after some runtime of the job? You are using Ethernet?

-- Reuti

Dmitry_K_Intel2 · ‎06-16-2010

Hi Ansgar,

You are right - this issue looks like a communication problem. mpd running on node node26-05 lost connection with the ring.

What MPI version do you use? How often does this happen? Have you seen such problem on 4, 8, 12 nodes?

You can take a look at the /tmp/mpd2.logfile_username_xxxxx file on each node. Probably you'll find some information there.

Also there is --verbose option - try it out.

Regards!
Dmitry

Ansgar_Esztermann · ‎06-16-2010

Hi Reuti,

this is happening more or less immediately -- it does not take more than, say, 30 seconds from the start of the job to the first error messages.
Yes, we are using Ethernet. For some reason, Infiniband jobs are working fine.

I have learned one more thing since posting the question: something tries to make node-to-node ssh connections shortly after a job is started. These will fail (we have disabled ssh for non-admin users); however, no such thing happens for IB jobs.

A.

reuti_at_intel · ‎06-16-2010

Okay. As the ring is initially created fine, it seems to happen inside the jobscript. Do you set:

export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"

in the jobscript, so that the job-specific console can be reached?

I wonder, whether there is any "reconnection" facility built into `mpiexec`, which I'm not aware of.

Do you test with the small `mpihello` program or your final application?

-- Reuti

Ansgar_Esztermann · ‎06-16-2010

Yes, MPD_CON_EXT is set. I am testing both with a real-world application and a simple sleep (no MPI commands or even mpiexec involved).

A.

reuti_at_intel · ‎06-16-2010

As time of writing the Howto the main console was always placed in /tmp (this was hardcoded at that time). Nowadays it can be directed to be anywhere by an environment variable or command line option to several of the mpd/mpi* commands.

I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:

export MPD_TMPDIR=$TMPDIR

which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.

-- Reuti

Ansgar_Esztermann · ‎06-16-2010

Hi Dmitry,

this is with MPI 3.2.1.009. It happens almost every time with large jobs (32 nodes with 4 cores each) and sometimes (maybe every two runs) with smaller jobs (2x4 cores, 4x4 cores).

Regards,

A.

[Edit: mpd does not have a --verbose option]

Ansgar_Esztermann · ‎07-13-2010

Thanks, Reuti, that did it!