Intel MPI connect problems

paul312 · ‎11-23-2021

I have a three node cluster connected by Myrinet infiniband switch. I have a common NFS home directory for each node. The startup sequence for each node (bashrc, etc.) is thus identical. Let's call the nodes, node1, node2, and node3. I have disabled the firewalls on each server (which are running the latest version of CentoOS 7).

Each node runs mpi on itself without problem, e.g.

mpirun -n 32 -host localhost hostname

Now is the confusing part. Running the following works fine:

node1> mpirun -n 32 -host node2 hostname
node1> mpirun -n 32 -host node3 hostname

I can even specify multiple nodes:

node1> mpirun -n 96 -hosts localhost,node2,node3 hostname

This runs fine and I can see the output of node1, node2, and node2

Running the same commands from node2 works with node1, but hangs when attempted with node3, e.g.

node2> mpirun -n 32 -host node3 hostname

hangs. The opposite also does not work:

node3> mpirun -n 32 -host node2 hostname

However, the following two combinations work fine:

node2> mpirun -n 32 -host node1 hostname
node3> mpirun -n 32 -host node1 hostname

I am unsure how to troubleshoot at this point. Any suggestions would be gratefully received.

The openapi version is the latest (as of today)

Intel(R) MPI Library for Linux* OS, Version 2021.4 Build 20210831 (id: 758087adf)

SantoshY_Intel · ‎11-24-2021

Hi,

Thanks for reaching out to us.

The Intel MPI Library uses an SSH mechanism to access remote nodes. SSH requires a password and this may cause the MPI application to hang.

So, could you please check whether you can do passwordless ssh from node2 to node3 & from node3 to node2?

If passwordless ssh fails, then you need to establish a passwordless SSH connection between node2 and node3 to ensure proper communication of MPI processes.

You can try to do:

1. Check the SSH settings.

2. Make sure that the passwordless authorization by public keys is enabled and configured.

If the issue still persists, then could you please provide the debug log for the below command on node2?

I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host node3 hostname

Also, please provide the debug log for either of the below two commands on node2.

clck -f nodefile -Fhealth_user //for user
clck -f nodefile -Fhealth_admin //for admin

Thanks & Regards,

Santosh

paul312 · ‎11-25-2021

I discovered that the vendor had made an error in the static IPv4 assignment so that on node2, in the /etc/hosts file the address of node3 actually pointed to node1. Of course, I had checked that password-less ssh worked on both the 10 Gb ethernet and infiniband, but I didn't notice that when I logged into node3 from node2 that the prompt for node1> came up. In short, all nodes connect via pasword-less ssh without error (and to the correct node!). Note that the 10Gb ethernet addresses indicated here are node1, node2, and node3. The infiniband addresses are node1-ib, node2-ib, and node3-ib.

The problem with node2 persists and if anything is stranger than before. The details are listed below, but in short node1 connects with all three nodes without problem. node3 connects with node1 and itself without problem. node2 connects to itself, but not to node1 or node3. Firewalls are dsiabled (for the moment). A systemctl status firewalld confirms this. Actually the fact that node1 connects to all three nodes confirms this as well.

Any idea of what to try next?

1. From node1

mpirun -n 96 -host localhost,node2-ib,node3-ib hostname

> works on all three hosts without error

2. from node1

mpirun -n 32 -host node2 hostname

> works without error

3. From node2

mpirun -n 32 -host localhost hostname

> works without problem

4. from node2

mpirun -n 32 -host node3 hostname

> hangs

5. The command

I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host node3 hostname

> hangs and gives the output (in box) below. The "clck -f nodefile -Fhealth_user" whether run as user or root returns the error message:

Nodefile could not be accessed: nodefile

(base) paulfons@tau:~>I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host node3-ib  hostname
[mpiexec@tau] Launch arguments: /usr/bin/ssh -q -x node3-ib /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_bstrap_proxy --upstream-host tau --upstream-port 35949 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/oneapi/mpi/2021.4.0/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@tau] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on neutrino-ib (pid 20019, exit code 768)
[mpiexec@node2] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node2] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node2] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@tau] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node2] Possible reasons:
[mpiexec@node2] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node2] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node2] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.
(base) paulfons@node2:~>clck -f nodefile -Fhealth_admin 
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)

Nodefile could not be accessed: nodefile

mpirun -n 32 -host node1-ib hostname

> hangs with the same error below:

(base) me@node2:/data/temp>mpirun -n 32 -host node1  hostname
[mpiexec@node2] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on muon (pid 24420, exit code 768)
[mpiexec@node2] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node2] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node2] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@node2] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node2] Possible reasons:
[mpiexec@node2] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node2] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node2] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.

mpirun -n 32 -host node3-ib hostname

> hangs with the same error message below:

base) me@node2:/data/temp>mpirun -n 32 -host node3 hostname
[mpiexec@node2] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on neutrino (pid 24383, exit code 768)
[mpiexec@node2] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node2] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node2] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@tau] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node2] Possible reasons:
[mpiexec@node2] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node2] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node2] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.

6. from node3

mpirun -n 32 -host node1 hostname

> runs without error

mpirun -n 32 -host node1-ib hostname

> runs without error

mpirun -n 32 -host node3 hostname

> runs without error

mpirun -n 32 -host node3-ib hostname

> runs without error

mpirun -n 32 -host node2-ib hostname

> runs without error

mpirun -n32 -host node2 hostname

> runs without error

7. From node2

SantoshY_Intel · ‎11-25-2021

Hi,

"nodefile" is a file containing the list of available nodes in the cluster.

If the nodefile doesn't exist, then it will throw the below error:

So, create a nodefile with the list of available nodes.

For example:

$cat nodefile

node1

node2

node3

Now run the below command on node2:

clck -f nodefile -Fhealth_user

The above command will generate clck_results.log & clck_execution_warnings.log along with some analysis as seen in the attached image.

Please provide us the clck_results.log & clck_execution_warnings.log

Thanks & Regards,

Santosh

paul312 · ‎11-26-2021

Dear Santosh,

I attempted to try the clck process, but the process appears to hang without progress (four hours or more). I attached the content below for reference.

(base) paulfons@tau:~>I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host neutrino  hostname
[mpiexec@tau] Launch arguments: /usr/bin/ssh -q -x neutrino /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_bstrap_proxy --upstream-host tau --upstream-port 37470 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/oneapi/mpi/2021.4.0/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@tau] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on neutrino (pid 43869, exit code 768)
[mpiexec@tau] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@tau] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@tau] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@tau] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@tau] Possible reasons:
[mpiexec@tau] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@tau] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@tau] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.
(base) paulfons@tau:~>clck -f nodefile -Fhealth_user
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)

Running Collect

...

In addition, I also immediately tried the referenced ssh command "/usr/bin/ssh -x -q node3" and it worked without error.

Do you have any further suggestions?

SantoshY_Intel · ‎12-02-2021

Hi,

Please try the below steps on the node2 command prompt for adding the IP addresses of node1 and node3 to the list of known hosts on node2:

ssh <ip address of node1>
ssh <ip address of node3>

The above commands might give you the below statement:

"Are you sure you want to continue connecting (yes/no/[fingerprint])?"

**For the above statement provide your answer as "yes".

A warning will be displayed which confirms that the IP address is added to the list of known hosts as below:

Warning: Permanently added <ip address of node 1>[<ip address of node3>] to the list of known hosts.

After successful ssh to node1/node3, you will be in node1's/node3's command prompt. So, do run the command "exit" and return to node2 terminal.

After adding both IP addresses of node1 and node3 to the list of known hosts on node2, try the below command on node2:

mpirun -bootstrap ssh -n 6 -ppn 2 -hosts <IP address of node1>,<IP address of node 2>,<IP address of node3> hostname

We tried the above steps and were able to run successfully as shown in the below screenshot.

So, could you please try the above steps and let us know whether it resolves your issue?

Also, we have observed that there is an inconsistency in the display of the hostname(tau & node2) in the debug log provided by you( as highlighted in yellow & green color in the below screenshot).

This dynamic change of hostname during the running of an MPI program at your end might also cause the problem. So, could you please let us know whether you have any idea why it is changing dynamically at your end?

Thanks & Regards,

Santosh

SantoshY_Intel · ‎12-10-2021

Hi,

We haven't heard back from you. Could you please provide an update on your issue? Please get back to us if the issue still persists.

Thanks & Regards,

Santosh

paul312 · ‎12-10-2021

Dear Santosh,

I am sorry to be so tardy in reply. I did check that ssh works between the nodes. The hostname difference was me trying to be clever. The real node name is tau, but I used an editor to change the name of the nodes to node1, node2, and node3 to make following the logins easier. Below is what happens when I connect to node2 (tau) and try to connect to nodes node3 or node1. Note the "-ib" is the infiniband network, the node without a suffix is the 10GB ethernet port. There are no problems logging in, but a simple "mpirun -n 32 -host node-ib hostname" hangs.

(base) user@node2:/data/Vasp/Cu/relax>ssh node3-ib
Last login: Tue Nov 30 12:28:58 2021 from 172.17.69.249
(base) user@node3:~>exit
logout
Connection to node3-ib closed.
(base) user@node2:/data/Vasp/Cu/relax>ssh node1-ib
Last login: Fri Dec 10 16:11:22 2021 from 172.17.69.249
(base) user@node1:~>exit
logout
Connection to node1-ib closed.
(base) user@node2:/data/Vasp/Cu/relax>ssh node3
Last login: Fri Dec 10 17:32:02 2021 from 192.168.1.3
(base) user@node3:~>exit
logout
Connection to node3 closed.
(base) user@node2:/data/Vasp/Cu/relax>ssh node1
Last login: Fri Dec 10 17:30:26 2021 from 192.168.1.3
(base) user@node1:~>exit
logout
Connection to node1 closed.
(base) user@node2:/data/Vasp/Cu/relax>ssh node3
Last login: Fri Dec 10 17:32:27 2021 from 172.17.69.3
ssh (base) user@node3:~>ssh node2-ib
Last login: Fri Dec 10 16:11:38 2021 from 172.17.69.249
(base) user@node2:~>exit
logout
Connection to node2-ib closed.
(base) user@node3:~>ssh node2
Last login: Fri Dec 10 17:31:08 2021 from 192.168.1.4
(base) user@node2:~>exit
logout
Connection to node2 closed.
(base) user@node3:~>exit
logout
Connection to node3 closed.

SantoshY_Intel · ‎12-10-2021

Hi,

>>"There are no problems logging in, but a simple "mpirun -n 32 -host node-ib hostname" hangs."

Could you please try using the "IP address of node-ib" instead of "node-ib"?

FI_PROVIDER=mlx mpirun -n 32 -host IPaddress-of-node hostname

For finding the IP address, use the below command on node-ib:

ifconfig

Please let us know whether "mpirun -n 32 -host IPaddress-of-node hostname" works or still hangs?

Thanks & Regards,

Santosh

SantoshY_Intel · ‎12-17-2021

Hi,

We haven't heard back from you. Could you please provide an update on your issue? Please get back to us if the issue still persists.

Thanks & Regards,

Santosh

paul312 · ‎12-24-2021

Hi Santosh,

I am sorry for the delay in replying. In effect, I have been working around the problem as I can use mpirun to run jobs on any combination of all three nodes from node1 and node3. The problem seems that I cannot submit jobs from node2 (except to node2 itself). The mpirun command hangs from node2 when directed to run a job on node1 or node3 regardless if IP addresses (infiniband or the 10Gb ethernet interfaces are used). I am at a loss as to what to try next as obviously the infiniband and mpi are running fine when initiated from node1 or node3 which implies that node2 is connected properly to the network. I also installed the latest oneapi version to make sure that the software was the same on all three does (it ostensibly was before but a version from earlier this year). Any ideas as to how to debug next with this new info.

Best wishes,

Paul

SantoshY_Intel · ‎12-24-2021

Hi,

We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards,

Santosh