Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel MPI mlx provider issue

Antonio_D
New Contributor I
212 Views

I have a program that I have compiled with oneAPI 2025.1 that will run just fine using I_MPI_OFI_PROVIDER=verbs (or any other provider really), but will not run with I_MPI_OFI_PROVIDER=mlx.

I_MPI_DEBUG=30 output:

[0] MPI startup(): PMI API: pmix
[0] MPI startup(): PMIx version: OpenPMIx 5.0.7 (PMIx Standard: 5.1, Stable ABI: 5.0, Provisional ABI: 5.0)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.15  Build 20250213 (id: d233448)
[0] MPI startup(): Copyright (C) 2003-2025 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1 
[0] MPI startup(): libfabric version: 1.21.0-impi
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: verbs (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: verbs (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: tcp (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: shm (200.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557<info> "shm" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: ofi_rxm (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: psm2 (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557<info> "psm2" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: psm3 (707.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: mlx (1.4)
libfabric:1780409:1744907435::core:core:ofi_reg_dl_prov():675<warn> dlopen(/projects/site/gred/smpg/software/oneAPI/2025.1/mpi/2021.15/opt/mpi/libfabric/lib/prov/libefa-fi.so): libefa.so.1: cannot open shared object file: No such file or directory
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: ofi_hook_noop (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():530<info> registering provider: off_coll (121.0)
libfabric:1780409:1744907435::core:core:fi_getinfo_():1449<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
libfabric:1780409:1744907435::core:core:fi_getinfo_():1449<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:1780409:1744907435::core:core:fi_fabric_():1745<info> Opened fabric: mlx
libfabric:1780409:1744907435::core:core:fi_fabric_():1756<info> Using mlx provider 1.21, path:/projects/site/gred/smpg/software/oneAPI/2025.1/mpi/2021.15/opt/mpi/libfabric/lib/prov/libmlx-fi.so
[0] MPI startup(): addrnamelen: 1024
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(196)........: 
MPID_Init(1719)..............: 
MPIDI_OFI_mpi_init_hook(1741): 
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(196)........: 
MPID_Init(1719)..............: 
MPIDI_OFI_mpi_init_hook(1741): 
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
slurmstepd: error: *** STEP 16814.0 ON sc1nc124 CANCELLED AT 2025-04-17T09:30:36 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: sc1nc124: tasks 0-11: Killed

ucx_info -v:

# Library version: 1.16.0
# Library path: /usr/lib64/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision 02432d3
# Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-12.2

ucx_info -d | grep Transport:

#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: xpmem

mlx looks like it should be available from all of the outputs I see.  This looks like a cluster configuration issue, but I don't know where to start troubleshooting.  SLURM job scheduler is in use.

0 Kudos
0 Replies
Reply