- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The Intel MPI: "intel-tc_impi_em64t-7.2.1p-008" is installed on a HPC cluster with a Mellanox Infiniband (MT47396 Infiniscale-III Mellanox Technologies).
I'm facing a mpdboot problem here:
Initially tried to lauch mpd on all the 100+ nodes. It failed. To debug, started to use only 2 nodes:
# /opt/intel/impi/3.2.1.009/bin64/mpdboot --totalnum=2 --mpd=/opt/intel/impi/3.2.1.009/bin64/mpd -d --file=mpd2hosts --ifhn=10.148.0.68 --verbose
debug: starting
running mpdallexit on NODE1
LAUNCHED mpd on NODE1 via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.1.009/bin64/mpd --ncpus=1 --myhost=NODE1 -e -d -s 2
debug: mpd on NODE1 on port 48174
RUNNING: mpd on NODE1
debug: info for running mpd: {'ip': '172.16.112.61', 'ncpus': 1, 'list_port': 48174, 'entry_port': '', 'host': 'NODE1', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on NODE2 via NODE1
debug: launch cmd= rsh -n NODE2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.1.009/bin64/mpd -h NODE1 -p 48174 --ifhn=10.148.0.4 --ncpus=1 --myhost=NODE2 --myip=10.148.0.4 -e -d -s 2
debug: mpd on NODE2 on port
mpdboot_NODE1 (handle_mpd_output 850): from mpd on NODE2, invalid port info:
Permission denied.
Each node has two IB IPs: ib0 for MPI comm and ib1 for data comm to storage
# cat mpd2hosts
NODE1
NODE2 ifhn=
#
I checked, there are no mpd process hung in both nodes. What could be the issue here?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sangamesh,
You say that you have two IPs on each node: ib0 and ib1. In your /etc/hosts file, are the IPs of the two subnets using the same name? For example, NODE1 has 2 IP addresses associated with it? It would be great if you could provide your /etc/hosts file here.
Additionally, there's no need to specify the --mpd option for mpdboot, and the -ifhn option is actually for mpiexec; most likely that's being ignored. Since you have two networks on the cluster, you can explicitly specify the ib0 one by entering IP addresses in your mpd2hosts file.
Finally, are you using rsh or ssh for your remote shell? The default for Intel MPI Library is rsh so please note that when running mpdboot. If you need to use ssh, simply specify "-r ssh" on the command line.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Gergana
It got resolved after using (1) IB0 IP's in the mpd2hosts file and (2) -r ssh as remote shell. However with rsh it was giving the same "Permission denied" error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sangamesh,
Most likely, your other machine was denying access to the Intel MPI Library program since rsh doesn't have the correct ssh authenication keys (or any keys, for that matter).
Either way, I'm glad to hear you have it working. Let us know if you hit any other problems.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you mean "--ifhn" never be used with mpdboot command?
And, if I've two networks (1) ETHERNET "eth0" (2) INFINIBAND "ib0", is it possible to select any one of them at runtime using --ifhn=
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you mean "--ifhn" never be used with mpdboot command?
And, if I've two networks (1) ETHERNET "eth0" (2) INFINIBAND "ib0", is it possible to select any one of them at runtime using --ifhn=
Gergana was not precise enough in her answer.
There are 2 different communication - first is used between mpds and second is used for mpi communication. If you are going to use eth0 for mpds - you need to do this in your mpd2hosts file. If you are going to use ib0 for mpi communication - you'd better use I_MPI_NETMASK env variable. Something like this:
mpiexec -genv I_MPI_NETMASK ib0 -n 123 ./your_app
Read Refenece Manual for full details.
You mentioned that it doesn't work with rsh connection. First af all be sure that your rsh was configured properly. Check that you are able to enter to other node: 'rsh node2' and from node2 back to node1: 'rsh node1'.
Regards!
Dmitry

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page