- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
System: centos7.9
l_BaseKit_p_2024.0.1.46_offline.sh and l_HPCKit_p_2024.0.1.38_offline.sh are installed at $HOME/intel on a cluster.
My code is compiled with intel c++ compiler and intel mpi library.
On the cluster, there are two sockets on each node with 32 cores on each socket.
It works well with 2 nodes/128cores, but failed with 3 or 4 nodes.
To run my program with 16 process and make each process using 16 cores for tbb parallel, the mpirun command is:
mpirun -np 16 -f $PBS_NODEFILE -map-by numa -genv I_MPI_PIN_DOMAIN=16:numa -genv FI_PROVIDER=mlx -trace-imbalance -print-rank-map MyProgramExe MyProgramArgs...
The failure information is:
[mpiexec@node16] Error: Unable to run bstrap_proxy on node16 (pid 37098, exit code 15)
[mpiexec@node16] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:157): check exit codes error
[mpiexec@node16] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:206): poll for event error
[mpiexec@node16] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1063): error waiting for event
[mpiexec@node16] Error setting up the bootstrap proxies
[mpiexec@node16] Possible reasons:
[mpiexec@node16] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node16] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@node16] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node16] 3. Firewall refused connection.
[mpiexec@node16] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node16] 4. pbs bootstrap cannot launch processes on remote host.
[mpiexec@node16] You may try using -bootstrap option to select alternative launcher.
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@xiangzhang Note:
CentOS 7.9 is not supported anymore. Please refer to the system requirements and make sure you are in a supported environment.
However, those errors seem to be that there is a connection issue between your nodes.
Probably there is no issue between the first set of two nodes (a,b) but when you include c or d, there is an issue between a and c or a and d etc. So please make sure you can run something like hostname on all nodes via your job schedular.
Did you run successfully using node pairs a,c / a,d / b,c / c,d ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TobiasK,
Thank you for reply.
I tested my program with both gcc and intel compiler, and both openmpi and intelmpi library. It is ok to use 4 nodes with openmpi. Some other users are also using this cluster. Each job is assigned with available nodes. There seems no other user complaining about connection between any two nodes. Anyway, I'll check this further.
I found that 2022 version oneapi toolkit support centos7, but can't find any download link. It looks like that I need register oneapi product in IRC with EID or KPID or SN number. But I didn't find a way to get one. Could you provide any information about registration code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@xiangzhang
sorry for the delay, please check here about purchasing priority support.
https://www.intel.com/content/www/us/en/developer/tools/oneapi/commercial-base-hpc.html

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page