- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I am trying to benchmark a number of clusters that we have in operation. I have not issues with the pure OmniPath cluster. I do have issues running MPI Benchmarks on our 10GbE based cluster and our Mellanox Cluster. I have attached the PBS Pro Script as well as the output.
The output includes the mpi_info data within the output. It's clear that we are not communicating between ports and does throw an error. What are the proper Environment Variables settings to force the MPI over the 10GbE?
UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy, knem/memory - no am bcopy
[proxy:0:0@lssd530-cs09] pmi cmd from fd 6: cmd=abort exitcode=1091215
[0] Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
[0] MPIR_Init_thread(138)........:
[0] MPID_Init(1141)..............:
[0] MPIDI_OFI_mpi_init_hook(1647): OFI get address vector map failed
- Balises:
- 10Gbe
- MPI Benchmarks
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi John,
I ran the benchmarks on our Mellanox Infiniband cluster and it ran fine without any errors.
Thanks for providing the logs. Could you please provide the logs after setting I_MPI_DEBUG=10? that would help us a lot.
Regards
Prasanth
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I have attached the run with the I_MPI_DEBUG=10 set.
The focus on my confusion is the proper settings for the MPI Variables. The runs I have made in the past only required the Fabrics and interface to be defined to use the Ethernet or OmniPath. With this new cluster it does not seem to make the peer-to-peer connections.
I_MPI_FABRICS=ofi
I_MPI_NETMASK=eth
I_MPI_STATS=ilm
MPI_USE_IB=False
PBS_MPI_DEBUG=True
I_MPI_HYDRA_IFACE=eno1
I_MPI_DEBUG=10
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi John,
Some of the environment variables you have set are not needed as they are the default options. Also the I_MPI_NETMASK is not supported and MPI_USE_IB is not an Intel MPI variable.
Could you once unset all the environmental variable and check if you get any errors?
Regards
Prasanth
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
No difference. Removing the variables made no difference.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi John,
Could you please provide the following details of you Cluster.
i) UCX version :-
You can get the version using ucx_info command
ii) Available transports :-
Command: ucx_info -d | grep Transport
iii) InfiniBand Version :-
Command: lspci | grep -i mellanox
iv) 10gbE interconnect Hardware
Regards
Prasanth
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
ucx_info -v
# UCT version=1.9.0 revision 1d0a420
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2
******
$ucx_info -d | grep Transport
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: cma
# Transport: knem
********
$lspci | grep -i mellanox
06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
******
The Ethernet is the built in 10GbE
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi John,
From you reply we can see that your cluster do not have all the required transports for mlx to work. As per this article(Improve Performance and Stability with Intel® MPI Library on...) mlx requires dc, rc, and ud transports.
Could you please ask your system administrator to install these transports and check if the error still persists?
Also could you once try the verbs provider for the InfiniBand cluster and let us know if it works? (FI_PROVIDER=verbs)
Regards
Prasanth
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
We load everything using the following:
. /utils/opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/mpivars.sh
. /utils/opt/intel/mkl/bin/mklvars.sh intel64
source /utils/opt/intel/impi/2021.1.1/setvars.sh
Where are those transports being sourced from?
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi John,
We haven't heard back from you.
Have you installed those mentioned transports that are required for mlx to work?
Did it solve the issue?
Please let us know.
Regards
Prasanth
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi John,
We are closing this thread assuming your issue has been resolved.
We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only
Regards
Prasanth
- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable