I am trying to benchmark a number of clusters that we have in operation. I have not issues with the pure OmniPath cluster. I do have issues running MPI Benchmarks on our 10GbE based cluster and our Mellanox Cluster. I have attached the PBS Pro Script as well as the output.
The output includes the mpi_info data within the output. It's clear that we are not communicating between ports and does throw an error. What are the proper Environment Variables settings to force the MPI over the 10GbE?
UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy, knem/memory - no am bcopy
[proxy:0:0@lssd530-cs09] pmi cmd from fd 6: cmd=abort exitcode=1091215
 Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
 MPIDI_OFI_mpi_init_hook(1647): OFI get address vector map failed
I ran the benchmarks on our Mellanox Infiniband cluster and it ran fine without any errors.
Thanks for providing the logs. Could you please provide the logs after setting I_MPI_DEBUG=10? that would help us a lot.
I have attached the run with the I_MPI_DEBUG=10 set.
The focus on my confusion is the proper settings for the MPI Variables. The runs I have made in the past only required the Fabrics and interface to be defined to use the Ethernet or OmniPath. With this new cluster it does not seem to make the peer-to-peer connections.
Some of the environment variables you have set are not needed as they are the default options. Also the I_MPI_NETMASK is not supported and MPI_USE_IB is not an Intel MPI variable.
Could you once unset all the environmental variable and check if you get any errors?
Could you please provide the following details of you Cluster.
i) UCX version :-
You can get the version using ucx_info command
ii) Available transports :-
Command: ucx_info -d | grep Transport
iii) InfiniBand Version :-
Command: lspci | grep -i mellanox
iv) 10gbE interconnect Hardware
# UCT version=1.9.0 revision 1d0a420
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2
$ucx_info -d | grep Transport
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: cma
# Transport: knem
$lspci | grep -i mellanox
06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
The Ethernet is the built in 10GbE
From you reply we can see that your cluster do not have all the required transports for mlx to work. As per this article(Improve Performance and Stability with Intel® MPI Library on...) mlx requires dc, rc, and ud transports.
Could you please ask your system administrator to install these transports and check if the error still persists?
Also could you once try the verbs provider for the InfiniBand cluster and let us know if it works? (FI_PROVIDER=verbs)
We haven't heard back from you.
Have you installed those mentioned transports that are required for mlx to work?
Did it solve the issue?
Please let us know.
We are closing this thread assuming your issue has been resolved.
We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only