Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

mpdboot - fails

atheodore
Beginner
667 Views

Hello,

I've followed the instructions from the Intel MPI Getting started documentation, but I'm having problems getting mpd running across my system. I've done the following:

1) Verified that no python / mpd processes are running on compute nodes

2) Started mpdboot from head node "mpdboot -d -v -n 20 -r ssh"

mpdboot fails with a connection error. each time I run it it errors out on a different system..

debug: mpd on n14 on port 43729
mpdboot_n1 (handle_mpd_output 703): Failed to establish a socket connection with n14:43729 : (111, 'Connection refused')
mpdboot_n1 (handle_mpd_output 720): failed to connect to mpd on n14

When I ssh to n14, I do see mpd running...

n14:~ # ps -ef | grep python
root 7535 1 99 Jun09 ? 18:05:45 python /opt/intel/impi/3.1/bin/mpd.py -h icn4 -p 62021 --ifhn=172.18.1.14 --ncpus=1 --myhost=icn14 --myip=172.18.1.14 -e -d -s 20

After I clean up these mpd processes on the compute nodes, and try to re-run mpdboot.... i'll get a connection error on a different node..

Any ideas? By the way I can connect via SSH to any of the nodes OK.

Thanks,

Alex

0 Kudos
5 Replies
atheodore
Beginner
667 Views
I rebooted the compute nodes, and retried the mpdboot... command it bingo-bango.. it works... I'll have to keep an eye out for this... Not sure what would have prevented it from working before....
0 Kudos
TimP
Honored Contributor III
667 Views
Before resorting to reboot, try mpdallexit. When it works, it's the easy way to clean out allyour hung processes.
0 Kudos
jbuddie
Beginner
667 Views
Quoting - tim18
Before resorting to reboot, try mpdallexit. When it works, it's the easy way to clean out allyour hung processes.

my mpdboot fails also. I get similar errors, but I believe the reason for mpdboot failing is that it trys to start mpd on some random port number, which always turns out to be a blocked port (firewall). Cluster prodution environments are usually firewalled. see the error in my case

mpdboot_blade13 (handle_mpd_output 730): Failed to establish a socket connection with blade11:36126 : (113, 'No route to host')

First this could kind of error could be prevented by setting an ideal safe port range like in the MPICH2. In mpich2-1.0.7 the way to specify the port range for mpd to work in is to to set the MPICH_PORT_RANGE variable to a range of ports that are opened for use by the iptables. i.e.

export MPICH_PORT_RANGE=50001:59999

How could one achieve the same objective in intel mpi (environment)?

0 Kudos
TimP
Honored Contributor III
667 Views

The usual way to set up for mpi communication is to assure that you can ssh (without password) to each node. This is needed for initial installation of Intel MPI. Then, mpdboot should work with the -r ssh option.

0 Kudos
jbuddie
Beginner
667 Views
Quoting - tim18

The usual way to set up for mpi communication is to assure that you can ssh (without password) to each node. This is needed for initial installation of Intel MPI. Then, mpdboot should work with the -r ssh option.

ssh works fine without password. The question is how do u specify the port range that mpdboot should use to avoid forked mpd from engaging a port that is firewalled/blocked. mpdboot currently runs using any random port number. Are there any eny environmental variables to set to correct this.

Case in point MPICH2 current uses MPICH_PORT_RANGE=port_number_0:port_number_N and then you open this ports (port range) in your machine via iptables for MPICH2 mpdboot to use.

0 Kudos
Reply