- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I have trying to run SU2 CFD solver on my lab's cluster using the oneapi/2022.3 mpi. However I am facing an error whenever I am trying to run on more than one nodes. I found some similar errors on the forum, however none of the solution mentioned worked for me. Kindly help me in this regard. I am attaching the error I am encountering.
[mpiexec@cn031] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on cn032 (pid 68544, exit code 256)
[mpiexec@cn031] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@cn031] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@cn031] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@cn031] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@cn031] Possible reasons:
[mpiexec@cn031] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@cn031] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@cn031] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@cn031] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
cp: cannot stat ‘restart_flow.dat’: No such file or directory
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you follow the prerequisites listed here?
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-11/installation-and-prerequisites.html
You have to make sure that you can either use password-less ssh between all nodes of the cluster or set up a workload manager like Slurm, PBSpro, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply
I will look into the prerequisites and get back to you.
Its a password less ssh between the nodes and I am submitting my Job using PBS (pbs_version = 20.0.1).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please try:
I_MPI_HYDRA_BOOTSTRAP=ssh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tried to export I_MPI_HYDRA_BOOTSTRAP=ssh.
Now no error is being produced but the simulation job is not producing any results. Its shows running status by qstat command. When I logged-in to the allotted nodes by PBS to the job and ran "top" command, it shows no job running on the allotted nodes.
I am attaching the Job Script file I am using to run the simulation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry without any error message there is little I can do.
Please try to run the IMB-MPI1 benchmarks, if those succeed then the problem is somewhere else in your configuration.
mpirun -np 512 IMB-MPI1
You can also add I_MPI_DEBUG=10 to get some more debug output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
Sorry for the late reply.
I ran the benchmark you mentioned using the job script:
#!/bin/bash
#PBS -N SU2_ROTOR_n4.128
#PBS -q AMD_Q
#PBS -l select=4:ncpus=128
#PBS -l walltime=96:00:00
#PBS -o /work/home/bakhshi/SU2/Test/Rotor_Scaleup/multi_nodes/n4/cpp128
#PBS -e /work/home/bakhshi/SU2/Test/Rotor_Scaleup/multi_nodes/n4/cpp128
cd $PBS_O_WORKDIR
NODEFILE=$PBS_NODEFILE
PPN=$(cat $NODEFILE | wc -l)
module purge;
module load oneapi/2022.3/mpi/latest;
module load compilers/gcc/13.2.0;
module load anaconda3/2021.11;
#export I_MPI_HYDRA_IFACE="ib0"
echo $PPN
eval "$(conda shell.bash hook)";
export SU2_HOME=/work/home/bakhshi/SU2/SU2-Install
echo $NODEFILE
echo $PPN
export I_MPI_HYDRA_BOOTSTRAP=ssh
mpirun -np 512 IMB-MPI1 I_MPI_DEBUG=10
However, the job is just submitted and showing running status, although its not running on the nodes or producing any output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
please make sure to use the latest MPI release and a supported OS.
Please also make sure to set a clean environment without conda or anything on top.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I cleaned the environment, reinstalled the solver with the mpi and used I_MPI_HYDRA_BOOTSTRAP=ssh during job submission, which removed the bstrap_proxy error. However, I am facing another write error:
[mpiexec@cn031] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@cn031] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@cn031] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:554): downstream from host cn032 exited with status 255
[mpiexec@cn031] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:554): downstream from host cn033 exited with status 255
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try to verify your PBS system is set up correctly, e.g. by running something like "hostname" through the batch system on all nodes.
This error is described in our troubleshooting guide and usually it refers to a problem with your cluster setup - something I can not help you with.
As soon as you fixed your cluster setup, I would advice to first use the IMB-MPI1 benchmarks before trying your solver.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page