Issue with MPI 2019U6 and MLX provider

Ade_F_ · ‎01-06-2020

Hi

We have two clusters that are almost identical except that one is now running Mellanox OFED 4.6 and the other 4.5.

With MPI 2019U6 from Studio 2020 distribution, one cluster (4.5) works OK, the other (4.6) does not and throws some UCX errors:

]$ cat slurm-151351.out
I_MPI_F77=ifort
I_MPI_PORT_RANGE=60001:61000
I_MPI_F90=ifort
I_MPI_CC=icc
I_MPI_CXX=icpc
I_MPI_DEBUG=999
I_MPI_FC=ifort
I_MPI_HYDRA_BOOTSTRAP=slurm
I_MPI_ROOT=/apps/compilers/intel/2020.0/compilers_and_libraries_2020.0.166/linux/mpi
MPI startup(): Imported environment partly inaccesible. Map=0 Info=0
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 512, addrname_firstlen: 512
[0] MPI startup(): val_max: 4096, part_len: 4095, bc_len: 1030, num_parts: 1
[1578327353.181131] [scs0027:247642:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
[1578327353.180508] [scs0088:378614:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
Abort(1091471) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed

Is this possibly an Intel MPI issue or something at our end (where 2018 and early 2019 versions worked OK)?

Thanks
A

Shubham_C_Intel · ‎01-06-2020

Hi Ade,

Thanks for reaching out to us. We are working on your issue. we will get back to you soon.

-Shubham

James_T_Intel · ‎01-13-2020

Are you encountering this error with every program you are running, or only with certain programs?

Also, if you have installed Intel® Cluster Checker, please run

clck -f ./<nodefile> -F mpi_prereq_user

This will run diagnostic checks related to Intel® MPI Library functionality and help verify that the cluster is configured as expected.

Ade_F_ · ‎01-14-2020

It seems to be with every program, although admittedly I'm only trying noddy examples 'hello world' and a primes counting example.

All seem to work on the OFED 4.5 cluster, but fail on the OFED 4.6 cluster, when Studio 2020 is used.

Cluster checker happy except for the logical processor count as we have it enabled in BIOS but twiddled at boot on all our systems:

SUMMARY
Command-line:   clck -F mpi_prereq_user
Tests Run:      mpi_prereq_user
ERROR:          2 tests encountered errors. Information may be incomplete. See
                  clck_results.log and search for "ERROR" for more information.
Overall Result: 1 issue found - FUNCTIONALITY (1)
--------------------------------------------------------------------------------
2 nodes tested:         cdcs[0003-0004]
0 nodes with no issues:
2 nodes with issues:    cdcs[0003-0004]
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. There is a mismatch between number of available logical cores and maximum
     logical cores. Cores '40-79' are offline.
       2 nodes: cdcs[0003-0004]

HARDWARE UNIFORMITY
No issues detected.

PERFORMANCE
No issues detected.

SOFTWARE UNIFORMITY
No issues detected.

See clck_results.log for more information.

drMikeT · ‎01-17-2020

Hello Ade,

Have you tried to measure the performance "mlx" provider with MOFED 4.5? Can you run the standard "IMB" or OSU benchmarks?

Have you tried any other MPI stacks? OpenMPI is available with MOFED distributions and you can quickly try any of these benchmarks that come prebuilt.

regards

Michael

Ade_F_ · ‎01-21-2020

Hi Michael et al.

We only have this problem with 2020. 2019, 2018, OpenMPI, MPICH, Mellanox's HPCX OpenMPI all OK.

I have now - I think - isolated it to something between the mlx FI_PROVIDER and the MLNX_OFED 4.6 we have. Setting the provider to verbs appears to cure the problem, although is perhaps less than ideal. Equally the mlx provider has no issue on the MLNX_OFED 4.5 deployments we have.

Michael - if you are interested in performance separately - rather than just making it work - I can provide some IMB output.

Cheers

Ade

drMikeT · ‎01-21-2020

Ade,

In my tests, verbs provider offers 2-3GB/s at best which is really not good (6X below line speed for EDR).

Is your CPU Zen2 or Intel based?

Sure I can see some numbers :)

regards

Michael

0__Dops0 · ‎03-03-2020

I have the same problem and my architecture is AMD 7002 series (same behavior with Epyc 7000 series too when using more than 45 PPN) and running CentOS 7.6. The MLX provider doesn't work with 2019 U6. When using 2019 U5 and the default provider, I believe it is RxM, crashes when using more than 80 PPN i.e If I use 80 or less PPN and 9 nodes it works without errors. Not sure what is going on.

Error with 2019 U5 when using more than 80 PPN on 7002 series or 45 PPN on 7000 series:

MPIDI_OFI_send_lightweight_request:
(unknown)(): Other MPI error

Error with 2019 U6 on 7002 series with MLX FI_PROVIDER:

MPIDI_OFI_send_lightweight_request:
(unknown)(): Other MPI error

and an ADDR_INFO error

Furthermore, when using the MLX provider f_info returns an error -61

Dmitry_S_Intel · ‎09-25-2020

UCX_TLS=ud,sm,self

AThar2 · ‎01-21-2021

@Dmitry_S_Intel That works for me thanks.

For me the problem only occurred when I launched above 10 nodes

But what does your suggestion it mean - the last thing I want is having my nodes running over the ethernet connection? Can you please explain whether that is the case?

solaremg · ‎11-10-2021

Hello, how was this variable implemented in the script? As seen below? I am also receiving the "OFI get address vector map failed" error

export UCX_TLS=ud,sm,self