topic Analyzing mpd Ring Failures in Intel® MPI Library

Analyzing mpd Ring Failures

Ansgar_Esztermann — Mon, 14 Jun 2010 16:01:55 GMT

Hello,

we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:

startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
...
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826

Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.

Thanks for any insight,

A.

Analyzing mpd Ring Failures

reuti_at_intel — Wed, 16 Jun 2010 09:04:12 GMT

Quoting Ansgar Esztermann

Hi,

is this happening immediately or only after some runtime of the job? You are using Ethernet?

-- Reuti

Analyzing mpd Ring Failures

Dmitry_K_Intel2 — Wed, 16 Jun 2010 10:34:34 GMT

Hi Ansgar,

You are right - this issue looks like a communication problem. mpd running on node node26-05 lost connection with the ring.

What MPI version do you use? How often does this happen? Have you seen such problem on 4, 8, 12 nodes?

You can take a look at the /tmp/mpd2.logfile_username_xxxxx file on each node. Probably you'll find some information there.

Also there is --verbose option - try it out.

Regards!
Dmitry

Analyzing mpd Ring Failures

Ansgar_Esztermann — Wed, 16 Jun 2010 12:43:37 GMT

Hi Reuti,

this is happening more or less immediately -- it does not take more than, say, 30 seconds from the start of the job to the first error messages.
Yes, we are using Ethernet. For some reason, Infiniband jobs are working fine.

I have learned one more thing since posting the question: something tries to make node-to-node ssh connections shortly after a job is started. These will fail (we have disabled ssh for non-admin users); however, no such thing happens for IB jobs.

A.

Analyzing mpd Ring Failures

reuti_at_intel — Wed, 16 Jun 2010 13:23:18 GMT

Okay. As the ring is initially created fine, it seems to happen inside the jobscript. Do you set:

export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"

in the jobscript, so that the job-specific console can be reached?

I wonder, whether there is any "reconnection" facility built into `mpiexec`, which I'm not aware of.

Do you test with the small `mpihello` program or your final application?

-- Reuti

Analyzing mpd Ring Failures

Ansgar_Esztermann — Wed, 16 Jun 2010 13:54:17 GMT

Yes, MPD_CON_EXT is set. I am testing both with a real-world application and a simple sleep (no MPI commands or even mpiexec involved).

A.

Analyzing mpd Ring Failures

reuti_at_intel — Wed, 16 Jun 2010 14:38:04 GMT

As time of writing the Howto the main console was always placed in /tmp (this was hardcoded at that time). Nowadays it can be directed to be anywhere by an environment variable or command line option to several of the mpd/mpi* commands.

I don't know whether anything is removing the socket on the master node of the parallel job which breaks the ring, but we could try to relocate the main console into the job specific directory:

export MPD_TMPDIR=$TMPDIR

which has to be added to: startmpich2.sh, stopmpich2.sh and your job script (just below where MPD_CON_EXT is set therein). Maybe it will help.

-- Reuti

Analyzing mpd Ring Failures

Ansgar_Esztermann — Wed, 16 Jun 2010 15:27:03 GMT

Hi Dmitry,

this is with MPI 3.2.1.009. It happens almost every time with large jobs (32 nodes with 4 cores each) and sometimes (maybe every two runs) with smaller jobs (2x4 cores, 4x4 cores).

Regards,

A.

[Edit: mpd does not have a --verbose option]

Analyzing mpd Ring Failures

Ansgar_Esztermann — Tue, 13 Jul 2010 15:12:40 GMT

Thanks, Reuti, that did it!