Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2272 Discussions

Unable to run application with intel mpi 2024

xiangzhang
Beginner
1,528 Views

System: centos7.9

l_BaseKit_p_2024.0.1.46_offline.sh and l_HPCKit_p_2024.0.1.38_offline.sh are installed at $HOME/intel on a cluster.

My code is compiled with intel c++ compiler and intel mpi library.

On the cluster, there are two sockets on each node with 32 cores on each socket.

It works well with 2 nodes/128cores, but failed with 3 or 4 nodes.

To run my program with 16 process and make each process using 16 cores for tbb parallel, the mpirun command is:

mpirun -np 16 -f $PBS_NODEFILE -map-by numa -genv I_MPI_PIN_DOMAIN=16:numa -genv FI_PROVIDER=mlx -trace-imbalance -print-rank-map  MyProgramExe MyProgramArgs...

 

The failure information is:

[mpiexec@node16] Error: Unable to run bstrap_proxy on node16 (pid 37098, exit code 15)
[mpiexec@node16] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:157): check exit codes error
[mpiexec@node16] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:206): poll for event error
[mpiexec@node16] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1063): error waiting for event
[mpiexec@node16] Error setting up the bootstrap proxies
[mpiexec@node16] Possible reasons:
[mpiexec@node16] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node16] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@node16] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node16] 3. Firewall refused connection.
[mpiexec@node16] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node16] 4. pbs bootstrap cannot launch processes on remote host.
[mpiexec@node16] You may try using -bootstrap option to select alternative launcher.
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002

 

 

 

0 Kudos
3 Replies
TobiasK
Moderator
1,522 Views

@xiangzhang Note:
CentOS 7.9 is not supported anymore. Please refer to the system requirements and make sure you are in a supported environment.

However, those errors seem to be that there is a connection issue between your nodes.

Probably there is no issue between the first set of two nodes (a,b) but when you include c or d, there is an issue between a and c or a and d etc. So please make sure you can run something like hostname on all nodes via your job schedular.

Did you run successfully using node pairs a,c / a,d / b,c / c,d ?

0 Kudos
xiangzhang
Beginner
1,499 Views

TobiasK,

Thank you for reply.

I tested my program with both gcc and intel compiler, and both openmpi and intelmpi library. It is ok to use 4 nodes with openmpi. Some other users are also using this cluster. Each job is assigned with available nodes. There seems no other user complaining about connection between any two nodes. Anyway, I'll check this further.

I found that  2022 version oneapi toolkit support centos7, but can't find any download link. It looks like that I need register oneapi product in IRC with EID or KPID or SN number. But I didn't find a way to get one. Could you provide any information about registration code?

 

0 Kudos
TobiasK
Moderator
1,391 Views

@xiangzhang 
sorry for the delay, please check here about purchasing priority support.
https://www.intel.com/content/www/us/en/developer/tools/oneapi/commercial-base-hpc.html

0 Kudos
Reply