Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

MPI Benchmarks over 10GbE, InfiniBand, or OmniPath

SunSDSE
Novice
3,848 Views

I am trying to benchmark a number of clusters that we have in operation.  I have not issues with the pure OmniPath cluster.  I do have issues running MPI Benchmarks on our 10GbE based cluster and our Mellanox Cluster. I have attached the PBS Pro Script as well as the output.

The output includes the mpi_info data within the output. It's clear that we are not communicating between ports and does throw an error.  What are the proper Environment Variables settings to force the MPI over the 10GbE?

 

UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy, knem/memory - no am bcopy

[proxy:0:0@lssd530-cs09] pmi cmd from fd 6: cmd=abort exitcode=1091215

[0] Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

[0] MPIR_Init_thread(138)........: 

[0] MPID_Init(1141)..............: 

[0] MPIDI_OFI_mpi_init_hook(1647): OFI get address vector map failed

Labels (1)
0 Kudos
10 Replies
PrasanthD_intel
Moderator
3,820 Views

Hi John,


I ran the benchmarks on our Mellanox Infiniband cluster and it ran fine without any errors.

Thanks for providing the logs. Could you please provide the logs after setting I_MPI_DEBUG=10? that would help us a lot.


Regards

Prasanth


0 Kudos
SunSDSE
Novice
3,812 Views

I have attached the run with the I_MPI_DEBUG=10 set.

The focus on my confusion is the proper settings for the MPI Variables.  The runs I have made in the past only required the Fabrics and interface to be defined to use the Ethernet or OmniPath. With this new cluster it does not seem to make the peer-to-peer connections.

I_MPI_FABRICS=ofi
I_MPI_NETMASK=eth
I_MPI_STATS=ilm
MPI_USE_IB=False
PBS_MPI_DEBUG=True
I_MPI_HYDRA_IFACE=eno1
I_MPI_DEBUG=10

0 Kudos
PrasanthD_intel
Moderator
3,792 Views

Hi John,


Some of the environment variables you have set are not needed as they are the default options. Also the I_MPI_NETMASK is not supported and MPI_USE_IB is not an Intel MPI variable.

Could you once unset all the environmental variable and check if you get any errors?


Regards

Prasanth


0 Kudos
SunSDSE
Novice
3,789 Views

No difference.  Removing the variables made no difference.

0 Kudos
PrasanthD_intel
Moderator
3,743 Views

Hi John,


Could you please provide the following details of you Cluster.

i) UCX version :-

You can get the version using ucx_info command

ii) Available transports :-

Command: ucx_info -d | grep Transport

iii) InfiniBand Version  :-

Command: lspci | grep -i mellanox

iv) 10gbE interconnect Hardware


Regards

Prasanth




0 Kudos
SunSDSE
Novice
3,739 Views

ucx_info -v

# UCT version=1.9.0 revision 1d0a420

# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2

******

$ucx_info -d | grep Transport

#   Transport: posix

#   Transport: sysv

#   Transport: self

#   Transport: tcp

#   Transport: cma

#   Transport: knem

 

********

$lspci | grep -i mellanox

06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

 

******

The Ethernet is the built in 10GbE 

 

0 Kudos
PrasanthD_intel
Moderator
3,725 Views

Hi John,


From you reply we can see that your cluster do not have all the required transports for mlx to work. As per this article(Improve Performance and Stability with Intel® MPI Library on...) mlx requires dc, rc, and ud transports.

Could you please ask your system administrator to install these transports and check if the error still persists?


Also could you once try the verbs provider for the InfiniBand cluster and let us know if it works? (FI_PROVIDER=verbs)


Regards

Prasanth


0 Kudos
SunSDSE
Novice
3,583 Views

We load everything using the following:

. /utils/opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/mpivars.sh

. /utils/opt/intel/mkl/bin/mklvars.sh  intel64

source /utils/opt/intel/impi/2021.1.1/setvars.sh

Where are those transports being sourced from?

 

0 Kudos
PrasanthD_intel
Moderator
3,684 Views

Hi John,


We haven't heard back from you.

Have you installed those mentioned transports that are required for mlx to work?

Did it solve the issue?

Please let us know.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
3,619 Views

Hi John,


We are closing this thread assuming your issue has been resolved.

We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Regards

Prasanth


0 Kudos
Reply