- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
We have two clusters that are almost identical except that one is now running Mellanox OFED 4.6 and the other 4.5.
With MPI 2019U6 from Studio 2020 distribution, one cluster (4.5) works OK, the other (4.6) does not and throws some UCX errors:
]$ cat slurm-151351.out
I_MPI_F77=ifort
I_MPI_PORT_RANGE=60001:61000
I_MPI_F90=ifort
I_MPI_CC=icc
I_MPI_CXX=icpc
I_MPI_DEBUG=999
I_MPI_FC=ifort
I_MPI_HYDRA_BOOTSTRAP=slurm
I_MPI_ROOT=/apps/compilers/intel/2020.0/compilers_and_libraries_2020.0.166/linux/mpi
MPI startup(): Imported environment partly inaccesible. Map=0 Info=0
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 512, addrname_firstlen: 512
[0] MPI startup(): val_max: 4096, part_len: 4095, bc_len: 1030, num_parts: 1
[1578327353.181131] [scs0027:247642:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
[1578327353.180508] [scs0088:378614:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
Abort(1091471) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
Is this possibly an Intel MPI issue or something at our end (where 2018 and early 2019 versions worked OK)?
Thanks
A
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ade,
Thanks for reaching out to us. We are working on your issue. we will get back to you soon.
-Shubham
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you encountering this error with every program you are running, or only with certain programs?
Also, if you have installed Intel® Cluster Checker, please run
clck -f ./<nodefile> -F mpi_prereq_user
This will run diagnostic checks related to Intel® MPI Library functionality and help verify that the cluster is configured as expected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems to be with every program, although admittedly I'm only trying noddy examples 'hello world' and a primes counting example.
All seem to work on the OFED 4.5 cluster, but fail on the OFED 4.6 cluster, when Studio 2020 is used.
Cluster checker happy except for the logical processor count as we have it enabled in BIOS but twiddled at boot on all our systems:
SUMMARY
Command-line: clck -F mpi_prereq_user
Tests Run: mpi_prereq_user
ERROR: 2 tests encountered errors. Information may be incomplete. See
clck_results.log and search for "ERROR" for more information.
Overall Result: 1 issue found - FUNCTIONALITY (1)
--------------------------------------------------------------------------------
2 nodes tested: cdcs[0003-0004]
0 nodes with no issues:
2 nodes with issues: cdcs[0003-0004]
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. There is a mismatch between number of available logical cores and maximum
logical cores. Cores '40-79' are offline.
2 nodes: cdcs[0003-0004]
HARDWARE UNIFORMITY
No issues detected.
PERFORMANCE
No issues detected.
SOFTWARE UNIFORMITY
No issues detected.
See clck_results.log for more information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Ade,
Have you tried to measure the performance "mlx" provider with MOFED 4.5? Can you run the standard "IMB" or OSU benchmarks?
Have you tried any other MPI stacks? OpenMPI is available with MOFED distributions and you can quickly try any of these benchmarks that come prebuilt.
regards
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Michael et al.
We only have this problem with 2020. 2019, 2018, OpenMPI, MPICH, Mellanox's HPCX OpenMPI all OK.
I have now - I think - isolated it to something between the mlx FI_PROVIDER and the MLNX_OFED 4.6 we have. Setting the provider to verbs appears to cure the problem, although is perhaps less than ideal. Equally the mlx provider has no issue on the MLNX_OFED 4.5 deployments we have.
Michael - if you are interested in performance separately - rather than just making it work - I can provide some IMB output.
Cheers
Ade
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ade,
In my tests, verbs provider offers 2-3GB/s at best which is really not good (6X below line speed for EDR).
Is your CPU Zen2 or Intel based?
Sure I can see some numbers :)
regards
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same problem and my architecture is AMD 7002 series (same behavior with Epyc 7000 series too when using more than 45 PPN) and running CentOS 7.6. The MLX provider doesn't work with 2019 U6. When using 2019 U5 and the default provider, I believe it is RxM, crashes when using more than 80 PPN i.e If I use 80 or less PPN and 9 nodes it works without errors. Not sure what is going on.
Error with 2019 U5 when using more than 80 PPN on 7002 series or 45 PPN on 7000 series:
MPIDI_OFI_send_lightweight_request:
(unknown)(): Other MPI error
Error with 2019 U6 on 7002 series with MLX FI_PROVIDER:
MPIDI_OFI_send_lightweight_request:
(unknown)(): Other MPI error
and an ADDR_INFO error
Furthermore, when using the MLX provider f_info returns an error -61
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Dmitry_S_Intel That works for me thanks.
For me the problem only occurred when I launched above 10 nodes
But what does your suggestion it mean - the last thing I want is having my nodes running over the ethernet connection? Can you please explain whether that is the case?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, how was this variable implemented in the script? As seen below? I am also receiving the "OFI get address vector map failed" error
export UCX_TLS=ud,sm,self
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page