- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to benchmark a number of clusters that we have in operation. I have not issues with the pure OmniPath cluster. I do have issues running MPI Benchmarks on our 10GbE based cluster and our Mellanox Cluster. I have attached the PBS Pro Script as well as the output.
The output includes the mpi_info data within the output. It's clear that we are not communicating between ports and does throw an error. What are the proper Environment Variables settings to force the MPI over the 10GbE?
UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy, knem/memory - no am bcopy
[proxy:0:0@lssd530-cs09] pmi cmd from fd 6: cmd=abort exitcode=1091215
[0] Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
[0] MPIR_Init_thread(138)........:
[0] MPID_Init(1141)..............:
[0] MPIDI_OFI_mpi_init_hook(1647): OFI get address vector map failed
- Tags:
- 10Gbe
- MPI Benchmarks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
I ran the benchmarks on our Mellanox Infiniband cluster and it ran fine without any errors.
Thanks for providing the logs. Could you please provide the logs after setting I_MPI_DEBUG=10? that would help us a lot.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have attached the run with the I_MPI_DEBUG=10 set.
The focus on my confusion is the proper settings for the MPI Variables. The runs I have made in the past only required the Fabrics and interface to be defined to use the Ethernet or OmniPath. With this new cluster it does not seem to make the peer-to-peer connections.
I_MPI_FABRICS=ofi
I_MPI_NETMASK=eth
I_MPI_STATS=ilm
MPI_USE_IB=False
PBS_MPI_DEBUG=True
I_MPI_HYDRA_IFACE=eno1
I_MPI_DEBUG=10
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Some of the environment variables you have set are not needed as they are the default options. Also the I_MPI_NETMASK is not supported and MPI_USE_IB is not an Intel MPI variable.
Could you once unset all the environmental variable and check if you get any errors?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No difference. Removing the variables made no difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Could you please provide the following details of you Cluster.
i) UCX version :-
You can get the version using ucx_info command
ii) Available transports :-
Command: ucx_info -d | grep Transport
iii) InfiniBand Version :-
Command: lspci | grep -i mellanox
iv) 10gbE interconnect Hardware
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ucx_info -v
# UCT version=1.9.0 revision 1d0a420
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2
******
$ucx_info -d | grep Transport
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: cma
# Transport: knem
********
$lspci | grep -i mellanox
06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
******
The Ethernet is the built in 10GbE
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
From you reply we can see that your cluster do not have all the required transports for mlx to work. As per this article(Improve Performance and Stability with Intel® MPI Library on...) mlx requires dc, rc, and ud transports.
Could you please ask your system administrator to install these transports and check if the error still persists?
Also could you once try the verbs provider for the InfiniBand cluster and let us know if it works? (FI_PROVIDER=verbs)
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We load everything using the following:
. /utils/opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/mpivars.sh
. /utils/opt/intel/mkl/bin/mklvars.sh intel64
source /utils/opt/intel/impi/2021.1.1/setvars.sh
Where are those transports being sourced from?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
We haven't heard back from you.
Have you installed those mentioned transports that are required for mlx to work?
Did it solve the issue?
Please let us know.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
We are closing this thread assuming your issue has been resolved.
We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only
Regards
Prasanth

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page