Community
cancel
Showing results for 
Search instead for 
Did you mean: 
SunSDSE
Novice
257 Views

MPI Benchmarks over 10GbE, InfiniBand, or OmniPath

I am trying to benchmark a number of clusters that we have in operation.  I have not issues with the pure OmniPath cluster.  I do have issues running MPI Benchmarks on our 10GbE based cluster and our Mellanox Cluster. I have attached the PBS Pro Script as well as the output.

The output includes the mpi_info data within the output. It's clear that we are not communicating between ports and does throw an error.  What are the proper Environment Variables settings to force the MPI over the 10GbE?

 

UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy, knem/memory - no am bcopy

[proxy:0:0@lssd530-cs09] pmi cmd from fd 6: cmd=abort exitcode=1091215

[0] Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

[0] MPIR_Init_thread(138)........: 

[0] MPID_Init(1141)..............: 

[0] MPIDI_OFI_mpi_init_hook(1647): OFI get address vector map failed

Labels (1)
Tags (2)
0 Kudos
9 Replies
PrasanthD_intel
Moderator
229 Views

Hi John,


I ran the benchmarks on our Mellanox Infiniband cluster and it ran fine without any errors.

Thanks for providing the logs. Could you please provide the logs after setting I_MPI_DEBUG=10? that would help us a lot.


Regards

Prasanth


SunSDSE
Novice
221 Views

I have attached the run with the I_MPI_DEBUG=10 set.

The focus on my confusion is the proper settings for the MPI Variables.  The runs I have made in the past only required the Fabrics and interface to be defined to use the Ethernet or OmniPath. With this new cluster it does not seem to make the peer-to-peer connections.

I_MPI_FABRICS=ofi
I_MPI_NETMASK=eth
I_MPI_STATS=ilm
MPI_USE_IB=False
PBS_MPI_DEBUG=True
I_MPI_HYDRA_IFACE=eno1
I_MPI_DEBUG=10

PrasanthD_intel
Moderator
201 Views

Hi John,


Some of the environment variables you have set are not needed as they are the default options. Also the I_MPI_NETMASK is not supported and MPI_USE_IB is not an Intel MPI variable.

Could you once unset all the environmental variable and check if you get any errors?


Regards

Prasanth


SunSDSE
Novice
198 Views

No difference.  Removing the variables made no difference.

PrasanthD_intel
Moderator
152 Views

Hi John,


Could you please provide the following details of you Cluster.

i) UCX version :-

You can get the version using ucx_info command

ii) Available transports :-

Command: ucx_info -d | grep Transport

iii) InfiniBand Version  :-

Command: lspci | grep -i mellanox

iv) 10gbE interconnect Hardware


Regards

Prasanth




SunSDSE
Novice
148 Views

ucx_info -v

# UCT version=1.9.0 revision 1d0a420

# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2

******

$ucx_info -d | grep Transport

#   Transport: posix

#   Transport: sysv

#   Transport: self

#   Transport: tcp

#   Transport: cma

#   Transport: knem

 

********

$lspci | grep -i mellanox

06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

 

******

The Ethernet is the built in 10GbE 

 

PrasanthD_intel
Moderator
134 Views

Hi John,


From you reply we can see that your cluster do not have all the required transports for mlx to work. As per this article(Improve Performance and Stability with Intel® MPI Library on...) mlx requires dc, rc, and ud transports.

Could you please ask your system administrator to install these transports and check if the error still persists?


Also could you once try the verbs provider for the InfiniBand cluster and let us know if it works? (FI_PROVIDER=verbs)


Regards

Prasanth


PrasanthD_intel
Moderator
93 Views

Hi John,


We haven't heard back from you.

Have you installed those mentioned transports that are required for mlx to work?

Did it solve the issue?

Please let us know.


Regards

Prasanth


PrasanthD_intel
Moderator
28 Views

Hi John,


We are closing this thread assuming your issue has been resolved.

We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Regards

Prasanth