- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, i have encountered several errors while trying to launch precompiled IMB-MPI1 test.
Setup includes 3 nodes: 1 entry-node with intel processors and 2 nodes: node1, node2 with amd processors.
Intel(R) Xeon(R) Gold 6248R – entry-node
AMD EPYC 9124 16-Core Processor – node1, node2
Debian 12.5 is installed on all nodes.
libfabric provider: mlx
Default bootstrap ssh is used.
Firewall is stopped and disabled.
Infiniband connection is established using mlx5 dual port card, so it is possible to successfully ibping any node. IPoIB is not configured.
Host-based Authentication is configured between any 2 nodes over additional ethernet network, though it spits: “get_socket_address: getnameinfo 8 failed: Name or service not known” before successful connection.
Tests are launched at entry-node.
Test results:
1. At any single curr_node in [entry-node, node1, node2]
mpirun -genv I_MPI_DEBUG=10 -n 2 -hosts curr_node ./IMB-MPI1
runs successfully and outputs all the stuff.
2. If 2 nodes are specified including entry-node
mpirun -genv I_MPI_DEBUG=10 -n 2 -ppn 1 -hosts entry-node,node1 ./IMB-MPI1
doesn’t reach mpi startup. log * below
3. If 2 nodes are specified excluding entry-node
mpirun -genv I_MPI_DEBUG=10 -n 2 -ppn 1 -hosts node2,node1 ./IMB-MPI1
test runs until 16384 bytes and after that hangs and spits an aborted(6) at node1 and killed(9) at node2 after message
[node1:285872:0:285872] ib_mlx5_log.c:179 Transport retry count exceeded on mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[node1:285872:0:285872] ib_mlx5_log.c:179 RC QP 0x7a2e wqe[0]: RDMA_READ s-- [rva 0x1484c40 rkey 0x28c427] [va 0x1cb7fd0 len 16384 lkey 0x240aab] [rqpn 0x5a63 dlid=2 sl=0 port=1 src_path_bits=0]
additional info in log**
log*:
[mpiexec@entry-node] Error: Unable to run bstrap_proxy on node1 (pid 4047388, exit code 768)
[mpiexec@entry-node] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:157): check exit codes error
[mpiexec@entry-node] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:206): poll for event error
[mpiexec@entry-node] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1063): error waiting for event
[mpiexec@entry-node] Error setting up the bootstrap proxies
[mpiexec@entry-node] Possible reasons:
[mpiexec@entry-node] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@entry-node] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@entry-node] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@entry-node] 3. Firewall refused connection.
[mpiexec@entry-node] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@entry-node] 4. Ssh bootstrap cannot launch processes on remote host.
[mpiexec@entry-node] Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@entry-node] You may try using -bootstrap option to select alternative launcher.
get_socket_address: getnameinfo 8 failed: Name or service not known
get_socket_address: getnameinfo 8 failed: Name or service not known
[bstrap:0:1@node1] HYD_sock_connect (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:209): getaddrinfo returned error -2 (Name or service not known)
[bstrap:0:1@node1] main (../../../../../src/pm/i_hydra/libhydra/bstrap/src/hydra_bstrap_proxy.c:532): unable to connect to server entry-node.test.com at port 41305 (check for firewalls!)
log**
get_socket_address: getnameinfo 8 failed: Name or service not known
get_socket_address: getnameinfo 8 failed: Name or service not known
get_socket_address: getnameinfo 8 failed: Name or service not known
get_socket_address: getnameinfo 8 failed: Name or service not known
[0] MPI startup(): Intel(R) MPI Library, Version 2021.12 Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.12/opt/mpi/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Number of NICs: 1
[0] MPI startup(): ===== NIC pinning on node2 =====
[0] MPI startup(): Rank Pin nic
[0] MPI startup(): 0 mlx
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1073943 node2 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
30,31}
[0] MPI startup(): 1 285872 node1 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
30,31}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.12
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=6
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.7, MPI-1 part
#----------------------------------------------------------------
# Date : Wed Jul 10 15:57:52 2024
# Machine : x86_64
# System : Linux
# Release : 6.1.0-22-amd64
# Version : #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21)
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# ./IMB-MPI1
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.74 0.00
1 1000 1.67 0.60
2 1000 1.66 1.21
4 1000 1.38 2.89
8 1000 1.39 5.77
16 1000 1.40 11.47
32 1000 1.53 20.86
64 1000 1.80 35.50
128 1000 1.78 71.82
256 1000 2.37 108.18
512 1000 2.63 194.34
1024 1000 3.07 333.37
2048 1000 4.05 505.51
4096 1000 5.73 715.18
8192 1000 8.79 931.74
16384 1000 14.25 1149.47
[node1:285872:0:285872] ib_mlx5_log.c:179 Transport retry count exceeded on mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[node1:285872:0:285872] ib_mlx5_log.c:179 RC QP 0x7a2e wqe[0]: RDMA_READ s-- [rva 0x1484c40 rkey 0x28c427] [va 0x1cb7fd0 len 16384 lkey 0x240aab] [rqpn 0x5a63 dlid=2 sl=0 port=1 src_path_bits=0]
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 285872 RUNNING AT node1
= KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Kanat
the error indicates some issues with your network.
Transport retry count exceeded on mlx5_1:1
How does the IB network look like? Do you have a switch installed or just back to back connections?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page