Oneapi issue on cluster (Unable to run bstrap_proxy)

Mehul2 · ‎01-07-2024

Hello

I have trying to run SU2 CFD solver on my lab's cluster using the oneapi/2022.3 mpi. However I am facing an error whenever I am trying to run on more than one nodes. I found some similar errors on the forum, however none of the solution mentioned worked for me. Kindly help me in this regard. I am attaching the error I am encountering.

[mpiexec@cn031] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on cn032 (pid 68544, exit code 256)
[mpiexec@cn031] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@cn031] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@cn031] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@cn031] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@cn031] Possible reasons:
[mpiexec@cn031] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@cn031] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@cn031] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@cn031] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
cp: cannot stat ‘restart_flow.dat’: No such file or directory
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

TobiasK · ‎01-11-2024

@Mehul2

Did you follow the prerequisites listed here?
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-11/installation-and-prerequisites.html

You have to make sure that you can either use password-less ssh between all nodes of the cluster or set up a workload manager like Slurm, PBSpro, etc.

Mehul2 · ‎01-11-2024

Thanks for the reply

I will look into the prerequisites and get back to you.

Its a password less ssh between the nodes and I am submitting my Job using PBS (pbs_version = 20.0.1).

TobiasK · ‎01-11-2024

Please try:
I_MPI_HYDRA_BOOTSTRAP=ssh

Mehul2 · ‎01-12-2024

I have tried to export I_MPI_HYDRA_BOOTSTRAP=ssh.

Now no error is being produced but the simulation job is not producing any results. Its shows running status by qstat command. When I logged-in to the allotted nodes by PBS to the job and ran "top" command, it shows no job running on the allotted nodes.

I am attaching the Job Script file I am using to run the simulation.

TobiasK · ‎01-12-2024

Sorry without any error message there is little I can do.

Please try to run the IMB-MPI1 benchmarks, if those succeed then the problem is somewhere else in your configuration.

mpirun -np 512 IMB-MPI1

You can also add I_MPI_DEBUG=10 to get some more debug output.

Mehul2 · ‎01-17-2024

Hello
Sorry for the late reply.
I ran the benchmark you mentioned using the job script:

#!/bin/bash
#PBS -N SU2_ROTOR_n4.128
#PBS -q AMD_Q
#PBS -l select=4:ncpus=128
#PBS -l walltime=96:00:00
#PBS -o /work/home/bakhshi/SU2/Test/Rotor_Scaleup/multi_nodes/n4/cpp128
#PBS -e /work/home/bakhshi/SU2/Test/Rotor_Scaleup/multi_nodes/n4/cpp128

cd $PBS_O_WORKDIR

NODEFILE=$PBS_NODEFILE
PPN=$(cat $NODEFILE | wc -l)

module purge;
module load oneapi/2022.3/mpi/latest;
module load compilers/gcc/13.2.0;
module load anaconda3/2021.11;
#export I_MPI_HYDRA_IFACE="ib0"

echo $PPN
eval "$(conda shell.bash hook)";

export SU2_HOME=/work/home/bakhshi/SU2/SU2-Install

echo $NODEFILE
echo $PPN

export I_MPI_HYDRA_BOOTSTRAP=ssh
mpirun -np 512 IMB-MPI1 I_MPI_DEBUG=10

However, the job is just submitted and showing running status, although its not running on the nodes or producing any output.

TobiasK · ‎01-18-2024

@Mehul2

please make sure to use the latest MPI release and a supported OS.

Please also make sure to set a clean environment without conda or anything on top.

Mehul2 · ‎01-19-2024

Hello

I cleaned the environment, reinstalled the solver with the mpi and used I_MPI_HYDRA_BOOTSTRAP=ssh during job submission, which removed the bstrap_proxy error. However, I am facing another write error:

[mpiexec@cn031] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@cn031] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@cn031] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:554): downstream from host cn032 exited with status 255
[mpiexec@cn031] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:554): downstream from host cn033 exited with status 255

TobiasK · ‎01-23-2024

@Mehul2

Try to verify your PBS system is set up correctly, e.g. by running something like "hostname" through the batch system on all nodes.

This error is described in our troubleshooting guide and usually it refers to a problem with your cluster setup - something I can not help you with.

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-11/error-message-bad-file-descriptor.html

As soon as you fixed your cluster setup, I would advice to first use the IMB-MPI1 benchmarks before trying your solver.