- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have provisioned an IB node on Azure Cloud which uses MLNX OFED driver
$ ofed_info -s
MLNX_OFED_LINUX-23.10-3.2.2.0:
$ uname -r
4.18.0-553.16.1.el8_10.x86_64
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.
$ mpirun -r ssh --host c1pib000001459,c1pib000001460 -np 8 -genv I_MPI_LARGE_SCALE_THRESHOLD 0 -genv MPI_HEALTHCHECK_TIMEOUT 120 -genv I_MPI_SPAWN 1 -genv I_MPI_PLATFORM auto -genv I_MPI_LIBRARY_KIND release -genv I_MPI_DEBUG 100 -genv I_MPI_ADJUST_ALLTOALLV 1 IMB-MPI1 -iter 10 -npmin 8 -msglen ./messages Allreduce
[0] MPI startup(): Run 'pmi_process_mapping' nodemap algorithm
[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:9780:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:9780:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:9780:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:9780:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:9780:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:9780:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:9780:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:9780:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:9780:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:9780:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:9780:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_ZE not supported
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: shm (114.0)
libfabric:9780:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:9780:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:9780:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:9780:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:9780:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:9780:psm3:core:fi_prov_ini():752<info> build options: VERSION=1102.0=11.2.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: psm3 (1102.0)
libfabric:9780:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:9780:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:9780:core:core:fi_getinfo_():1123<warn> Can't find provider with the highest priority
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1512):
open_fabric(2566)............:
find_provider(2684)..........: OFI fi_getinfo() failed (ofi_init.c:2684:find_provider:No data available)
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1512):
open_fabric(2566)............:
find_provider(2684)..........: OFI fi_getinfo() failed (ofi_init.c:2684:find_provider:No data available)
$ fi_info -l
psm3:
version: 1102.0
ofi_rxm:
version: 113.20
verbs:
version: 113.20
verbs:
version: 113.20
tcp:
version: 113.20
sockets:
version: 113.20
shm:
version: 114.0
ofi_hook_noop:
version: 113.20
env variables are -
FI_PROVIDER_PATH=[ArchBase]/intel-mpi/intel64/libfabric/lib/prov
I_MPI_ADJUST_ALLTOALLV=1
I_MPI_DEBUG=100
I_MPI_FABRICS=shm:ofi
I_MPI_LIBRARY_KIND=release
I_MPI_PLATFORM=auto
I_MPI_ROOT=[ArchBase]/intel-mpi
I_MPI_SPAWN=1
LIBRARY_PATH=[ArchBase]/intel-mpi/intel64
MPI_HEALTHCHECK_TIMEOUT=120
ZTOMO_FORCE_MPI_2016=no
ZTOMO_I_MPI_LARGE_SCALE_THRESHOLD=0
ZTOMO_I_MPI_ROOT=[ArchBase]/intel-mpi-2016
Why there is no mlx shown here ? What can I do to troubleshoot it further to get it working with mlx, note that with provider as "tcp" this works fine but I want to leverage IB.
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@HimeshKothari
you are setting I_MPI_ROOT different than FI_PROVIDER_PATH:
FI_PROVIDER_PATH=[ArchBase]/intel-mpi/intel64/libfabric/lib/prov
ZTOMO_I_MPI_ROOT=[ArchBase]/intel-mpi-2016
that might be your problem. Please upgrade to the latest Intel MPI release, 2021.14.1, as I cannot provide any further help with 2021.6
that might be your problem. Please upgrade to the latest Intel MPI release, 2021.14.1, as I cannot provide any further help with 2021.6

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page