Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2147 Discussions

Unable to run bstrap_proxy error with intel-oneapi-mpi 2021.8

Punit
Beginner
4,676 Views

Hi,

I'm trying to use mpirun from within the intel-oneapi-mpi 2021.8 (downloaded using a spack installation) on a simple test (mpirun -np X ls) across 2 nodes on an HPC system I have access to. It fails with the following error:

[mpiexec@node241] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node242 (pid 19049, exit code 256)
[mpiexec@node241] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node241] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node241] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@node241] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@node241] Possible reasons:
[mpiexec@node241] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node241] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node241] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node241] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.

 

In fact, I had this error in an earlier version of intel-oneapi-mpi as well (v2021.5) and on searching on one of the forums came across a post that suggested to upgrade to v2021.8. But the same error persists. 

I've since downgraded to an earlier version of intel-oneapi-mpi (2021.1.1) and that seems to work fine, however, so it seems as if there is some change (or bug?) in the latest versions of oneapi-mpi. Could you advise if this is a known problem? And if so, is it a bug in the latest versions of the software or is there some solution as to how to correct it? Also, what is the latest known version of intel-oneapi-mpi that does not exhibit this problem? I would assume somewhere between 2021.1.1 (first working version in 2021 release line) and 2021.5 (first known version with the above error in the 2021 release line). I can obviously test the versions in between but wanted to get some insight from the intel-oneapi mpi toolkit developers/experts. 

If there's any additional information you would like please let me know.

Many thanks in advance for your help.

Kind regards,
Punit 

0 Kudos
17 Replies
ShivaniK_Intel
Moderator
4,653 Views

Hi,

 

Thanks for posting in the Intel forums.

 

Could you please provide us with the sample reproducer code and steps to reproduce the issue at our end?

 

Could you also please provide us with the below details?

1. OS

2. Job Scheduler

3. Interconnect hardware

4.FI_PROVIDER

 

Refer below link for details regarding Job Scheduler.

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-6/job-schedulers-support.html

 

Thanks & Regards

Shivani

 

0 Kudos
ShivaniK_Intel
Moderator
4,600 Views

Hi,


As we did not hear back from you could you please provide us with the details asked in my previous post?


Thanks & Regards

Shivani


0 Kudos
Punit
Beginner
4,584 Views

Hi Shivani,

 

Thanks for your response, and apologies for my late reply. To answer your questions:

1. OS --> CentOS Linux release 7.9.2009 (Core)

2. Scheduler --> PBS (version 19.2.5.20191022141354)

 

Is there a specific command for checking interconnect hardware that you use normally? For FI_PROVIDER I send you the fi_info output - will this be enough? Otherwise I don't explicitly set it.

 

The example PBS submit script I used is as follows:

 

#!/bin/bash
#PBS -l walltime=24:00:00
#PBS -l select=2:ncpus=20:mpiprocs=1:mem=100mb:os=rh7
#PBS -l place=free:group=switch
#PBS -o ./run.sortie
#PBS -e ./run.erreur

cd "${PBS_O_WORKDIR}"

module load intel-oneapi-mpi/2021.5.1_ppz
export I_MPI_DEBUG=6
export I_MPI_DEBUG_OUTPUT=${PBS_O_WORKDIR}/mpi.dbg.log

MPICommand='mpirun'
#MPICommand='mpiexec'
export MPIOptions=""
export MPIOptionsMesh=""
MPIOptionsSerial=''
MPISerial='-np 1'
MPIParallel='-np'
#MPISerial='-n 1'
#MPIParallel='-n'

PROCS=40
$MPICommand $MPIOptions $MPIParallel $PROCS ls

0 Kudos
ShivaniK_Intel
Moderator
4,535 Views

 

Hi,

 

Thanks for providing the details.

 

From the above-mentioned details we came to know that you are working on CentOS Linux release 7.9.2009 (Core) which is an unsupported version of OS with Intel MPI 2021.8

 

For more details regarding the supported version of OS with Intel MPI please refer to the below link

 

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html

 

Let us know if you face a similar issue with the supported version of the OS.

 

Thanks & Regards

Shivani

 

0 Kudos
Punit
Beginner
4,508 Views

Hi Shivani,

 

Thanks for getting back to me, and thanks for the url to the intel mpi system requirements. May I ask - will the system requirements for oneapi mpi 2021.1 be similar? I ask because a spack installation of intel-oneapi-mpi-2021.1.1 on the Centos 7.9 OS works, but a native intel-oneapi-mpi-2021.8 does not. 

Also, I see the same problem with intel-oneapi-mpi-2021.8 on Red Hat 7 and 8 systems as well, still with PBS although the PBS version is slightly different.

Kind regards,

Punit

 

0 Kudos
ShivaniK_Intel
Moderator
4,470 Views

Hi,


The Intel MPI version 2021.7 supported the CentOS operating system until its release. However, the latest version 2021.8 no longer supports CentOS. Similarly, RHEL 7 is not supported by the latest version of Intel MPI, 2021.8.


If you are encountering an issue with the supported OS, RHEL 8, kindly share with us a sample reproducer code. This will assist us in further investigating and resolving the issue.


Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
4,413 Views

Hi,


As we did not hear back from you could you please respond to my previous post?


Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
4,362 Views

Hi,


I have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance please post a new question.


Thanks & Regards

Shivani


0 Kudos
Gregg_S_Intel
Employee
4,011 Views

Try mpirun -bootstrap=ssh

0 Kudos
Cloud_3
Beginner
3,604 Views
 

 

Hello,

I have a recent issue. If you have solved it, please indicate how you did it.



[mpiexec@cpunode03] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on cpunode04 (pid 8545, exit code 256)
[mpiexec@cpunode03] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@cpunode03] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@cpunode03] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@cpunode03] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@cpunode03] Possible reasons:
[mpiexec@cpunode03] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@cpunode03] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@cpunode03] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@cpunode03] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.


$which mpiexec
/opt/ohpc/pub/compiler/intel/oneapi/mpi/2021.7.1/bin/mpiexec


I tried to use '-bootstrap=ssh' to use SSH as the launcher, but I was not successful.



I tried to use '-bootstrap=ssh' to use SSH as the launcher, but I was not successful.

0 Kudos
Cloud_3
Beginner
3,604 Views

My script PBS


#PBS -N McsemInv
#
# Job Queue
#PBS -q cpu
#
# Name of output file
#PBS -o terminal.out
#
# Name of output error file
#PBS -e terminal.error
#
# Total number of nodes as MPI requested
#PBS -l nodes=4:ppn=8

echo "Starting the settings."
# Load in the Intel compiler
#module load intel/compiler
# Access the folder where the files are
cd $PBS_O_WORKDIR
sleep 10
mpiexec -np 32 ./InvMcsem >> saidasterminal.txt


#######Operating System: Linux - OpenSUSE 15############3


Arquitetura:                     x86_64
Modo(s) operacional da CPU:      32-bit, 64-bit
Ordem dos bytes:                 Little Endian
Tamanhos de endereço:            44 bits physical, 48 bits virtual
CPU(s):                          128
Lista de CPU(s) on-line:         0-127
Thread(s) per núcleo:            2
Núcleo(s) por soquete:           32
Soquete(s):                      2
Nó(s) de NUMA:                   2
ID de fornecedor:                AuthenticAMD
Família da CPU:                  23
Modelo:                          49
Nome do modelo:                  AMD EPYC 7532 32-Core Processo

 

0 Kudos
Gregg_S_Intel
Employee
3,601 Views

This script does not specify "mpiexec -bootstrap ssh".

 

0 Kudos
Cloud_3
Beginner
3,586 Views
 

 

Like I said. I tried to use '-bootstrap=ssh' to use SSH as the launcher, but I was not successful. Then I removed it

 

 

0 Kudos
Gregg_S_Intel
Employee
3,584 Views

As a fellow MPI user, the things I would check,

- Can I ssh from cpunode03 to cpunode04?

- Can I ssh from cpunode04 to cpunode03?

- Can I run an MPI test using only cpunode03?

- Can I run an MPI test using only cpunode04?

0 Kudos
Cloud_3
Beginner
3,577 Views

@Gregg_S_Intel wrote:

As a fellow MPI user, the things I would check,

- Can I ssh from cpunode03 to cpunode04?

- Can I ssh from cpunode04 to cpunode03?

- Can I run an MPI test using only cpunode03?

- Can I run an MPI test using only cpunode04?


Yes, I have already tried these means. Include with just one node. The error is the same.

0 Kudos
Gregg_S_Intel
Employee
3,562 Views

Typically I would be referencing PBS_NODEFILE in a PBS script.

NPROC=`wc -l < $PBS_NODEFILE`
mpirun -np $NPROC -machinefile $PBS_NODEFILE ./a.out

 

I would next try a single node run with I_MPI_FABRICS=shm.

 

0 Kudos
9866666
Beginner
2,419 Views

Check the infiniband connection. We had the same issue. On master node we didn't have an infiniband connection, but on the hpc nodes we had them.

Try to add these to the pbs configuration:

export I_MPI_HYDRA_IFACE="ib0"

 

As described here: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-MPI-Unable-to-run-bstrap-proxy-error-setting-up-the/m-p/1379558#M9442

 

 

0 Kudos
Reply