Solved: Re: Setting up the Intel® oneAPI MPI Library on a Linux cluster

JR · ‎10-29-2021

TL;DR:

How do I setup my system so that mpirun distributes processes over several hosts connected by a local network without throwing any errors? I suspect it fails because /opt/intel/oneapi/setvars.sh is not sourced unless ssh connects to a login shell, but I do not know what the alternative recourse is.

LONGER VERSION:

I have two machines on a LAN named server-1 and server-2 running Ubuntu 20.04.3 LTS. I have installed intel-basekit and intel-hpckit on both the machines, following the guidelines provided by Intel, and have modified /etc/profile so that /opt/intel/oneapi/setvars.sh gets sourced at every login. Furthermore, server-1 and server-2 share the same home directory (server-2 auto-mounts server-1:/home as its own home at boot). Finally, passwordless ssh login is enabled.

With this setup, the two machines can run MPI code independently, but cannot distribute the workload over the network. Since the Intel® oneAPI environment variables are only sourced at login, running Intel® oneAPI commands without a login shell (for instance ssh server-2 'mpirun -V') fails. I am not sure but I suspect this is the reason why I am getting errors when trying to distribute tasks over the two hosts.

If I execute the command,

mpirun -n 2 -ppn 1 -hosts server-1,server-2 hostname

then I get the following error.

[mpiexec@server-1] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on server-2 (pid 63079, exit code 768)
[mpiexec@server-1] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@server-1] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@server-1] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@server-1] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@server-1] Possible reasons:
[mpiexec@server-1] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@server-1] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@server-1] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@server-1] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@server-1] You may try using -bootstrap option to select alternative launcher.

Do you have any ideas how I may set up the system properly? I have tried looking for a solution in the documentation provided by Intel, but I haven't been able to find something that addresses exactly this issue. Your help is much appreciated. Thank you.

P.S. I don't know if it helps but running the Intel® Cluster Checker 2021, clck -f hostfile, I get a process that runs without stopping for hours. It gets stuck on "Running Collect..." and I run out of patience after some time.

SantoshY_Intel · ‎11-05-2021

Hi,

Glad to know that your issue is resolved.

Your debug log suggests that the MPI ranks cannot communicate with each other, because the firewall blocks the MPI communication.

In the below link, there are three methods to help you solve this problem in which method 2 and method 3 are firewall-friendly.

https://www.intel.com/content/www/us/en/developer/articles/technical/bkms-firewall-blocks-mpi-communication-among-nodes.html

Thanks & Regards,

Santosh

View solution in original post

SantoshY_Intel · ‎11-01-2021

Hi,

Thanks for reaching out to us.

From the debug log of your error, we found the below statement:

"[mpiexec@server-1] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable."

We assume that it is a firewall issue. So, could you please check whether the status of either "firewalld" or "ufw" is enabled? If yes, disable it using the below link so that you could run the MPI program between two machines successfully.

For more information and guidelines please refer to the below link:

https://www.intel.com/content/www/us/en/developer/articles/technical/bkms-firewall-blocks-mpi-communication-among-nodes.html

If you are using "ufw", to check its status use the below command:

sudo ufw status

If the status is active, then disable it using the below command:

sudo ufw disable

To disable the ufw on Linux at boot time, use the below command:

sudo systemctl disable ufw

Verify whether ufw is disabled:

sudo ufw status
sudo systemctl status ufw

You can now run MPI programs between two machines successfully.

If this resolves your issue, make sure to accept this as a solution. This would help others with a similar issue.

Thank you!

Best Regards,

Santosh

JR · ‎11-03-2021

Thank you for your answer. Flushing all the iptable rules does solve this issue. Is there a firewall-friendly solution as well?

SantoshY_Intel · ‎11-05-2021

Hi,

Glad to know that your issue is resolved.

Your debug log suggests that the MPI ranks cannot communicate with each other, because the firewall blocks the MPI communication.

In the below link, there are three methods to help you solve this problem in which method 2 and method 3 are firewall-friendly.

https://www.intel.com/content/www/us/en/developer/articles/technical/bkms-firewall-blocks-mpi-communication-among-nodes.html

Thanks & Regards,

Santosh

SantoshY_Intel · ‎11-05-2021

Hi,

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards,

Santosh

Setting up the Intel® oneAPI MPI Library on a Linux cluster

TL;DR:

LONGER VERSION:

Installation

MPI