Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2166 Discussions

Start of Intel oneAPI MPI 2021.10 library problem on more than one node on Linux

Frank_R_1
Beginner
1,539 Views

Hi,

We encountered problems with Intel oneAPI MPI 2021.10 library on both systems
Intel Xeon, Red Hat Enterprise Linux Workstation release 7.9 (Maipo)
AMD Epyc, Red Hat Enterprise Linux release 8.8 (Ootpa)
when we want to start processes with more than one compute node!

This problem arise on Intel Xeon as well as on AMD Epyc, see cpuinfo outputs attached:

Essentially the problem is:
[mpiexec@asrv0de102.corpdir.zz] Launch arguments: /clusterhead/projects/magma/v6.0.0.2-25420/v6.0.0/LINUX64/impi/bin//hydra_bstrap_proxy --upstream-host asrv0de102.corpdir.zz --upstream-port 44722 --pgid 0 --launcher ssh --launcher-number 0 --base-path /clusterhead/projects/magma/v6.0.0.2-25420/v6.0.0/LINUX64/impi/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /clusterhead/projects/magma/v6.0.0.2-25420/v6.0.0/LINUX64/impi/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@asrv0de102.corpdir.zz] Launch arguments: /usr/bin/ssh -q -x asrv0de103 /clusterhead/projects/magma/v6.0.0.2-25420/v6.0.0/LINUX64/impi/bin//hydra_bstrap_proxy --upstream-host asrv0de102.corpdir.zz --upstream-port 44722 --pgid 0 --launcher ssh --launcher-number 0 --base-path /clusterhead/projects/magma/v6.0.0.2-25420/v6.0.0/LINUX64/impi/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /clusterhead/projects/magma/v6.0.0.2-25420/v6.0.0/LINUX64/impi/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=appnum appnum=0
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=my_kvsname kvsname=kvs_15259_0
MPI startup(): Run 'pmi_process_mapping' nodemap algorithm
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get kvsname=kvs_15259_0 key=PMI_process_mapping
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=appnum appnum=0
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=my_kvsname kvsname=kvs_15259_0
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get kvsname=kvs_15259_0 key=PMI_process_mapping
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=barrier_out
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=barrier_out
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
libfabric:15264:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:15264:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:27535:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:27535:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:15264:1697807008::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so
libfabric:27535:1697807008::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so
libfabric:27535:1697807008::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:15264:1697807008::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:27535:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:27535:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:15264:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:15264:1697807008::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:15264:1697807008::core:mr:ofi_default_cache_size():79<info> default cache size=7507620522
libfabric:27535:1697807008::core:mr:ofi_default_cache_size():79<info> default cache size=7507620522
libfabric:27535:1697807008::core:core:ofi_register_provider():476<info> registering provider: ofi_hook_noop (118.0)
libfabric:27535:1697807008::core:core:ofi_register_provider():476<info> registering provider: off_coll (118.0)
libfabric:27535:1697807008::core:core:fi_getinfo_():1338<warn> Can't find provider with the highest priority
libfabric:15264:1697807008::core:core:ofi_register_provider():476<info> registering provider: ofi_hook_noop (118.0)
libfabric:15264:1697807008::core:core:ofi_register_provider():476<info> registering provider: off_coll (118.0)
libfabric:15264:1697807008::core:core:fi_getinfo_():1338<warn> Can't find provider with the highest priority
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1548)..............:
MPIDI_OFI_mpi_init_hook(1592):
open_fabric(2650)............:
find_provider(2794)..........: OFI fi_getinfo() failed (ofi_init.c:2794:find_provider:No data available)
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=abort exitcode=2139535
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=abort exitcode=2139535
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1548)..............:
MPIDI_OFI_mpi_init_hook(1592):
open_fabric(2650)............:
find_provider(2794)..........: OFI fi_getinfo() failed (ofi_init.c:2794:find_provider:No data available)
[mpiexec@asrv0de102.corpdir.zz] Exit codes: [asrv0de102:0] 547720960
[asrv0de103:0] 547720960

Whereas with Intel oneAPI 2021.7/2021.9 we get on the same systems:
[mpiexec@asrv0de102.corpdir.zz] Launch arguments: /clusterhead/software/MAGMA/MAGMA60/v6.0.0.3-26210/v6.0.0/LINUX64/impi/bin//hydra_bstrap_proxy --upstream-host asrv0de102.corpdir.zz --upstream-port 38748 --pgid 0 --launcher ssh --launcher-number 0 --base-path /clusterhead/software/MAGMA/MAGMA60/v6.0.0.3-26210/v6.0.0/LINUX64/impi/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /clusterhead/software/MAGMA/MAGMA60/v6.0.0.3-26210/v6.0.0/LINUX64/impi/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@asrv0de102.corpdir.zz] Launch arguments: /usr/bin/ssh -q -x asrv0de103 /clusterhead/software/MAGMA/MAGMA60/v6.0.0.3-26210/v6.0.0/LINUX64/impi/bin//hydra_bstrap_proxy --upstream-host asrv0de102.corpdir.zz --upstream-port 38748 --pgid 0 --launcher ssh --launcher-number 0 --base-path /clusterhead/software/MAGMA/MAGMA60/v6.0.0.3-26210/v6.0.0/LINUX64/impi/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /clusterhead/software/MAGMA/MAGMA60/v6.0.0.3-26210/v6.0.0/LINUX64/impi/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=appnum appnum=0
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=my_kvsname kvsname=kvs_15325_0
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get kvsname=kvs_15325_0 key=PMI_process_mapping
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=appnum appnum=0
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=my_kvsname kvsname=kvs_15325_0
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get kvsname=kvs_15325_0 key=PMI_process_mapping
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[0] MPI startup(): Run 'pmi_process_mapping' nodemap algorithm
[0] MPI startup(): Intel(R) MPI Library, Version 2021.7 Build 20220909 (id: 6b6f6425df)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=barrier_out
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=barrier_out
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:27577:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:27577:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:15330:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:15330:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:27577:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:27577:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:15330:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:15330:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:27577:core:mr:ofi_default_cache_size():78<info> default cache size=7507620522
libfabric:15330:core:mr:ofi_default_cache_size():78<info> default cache size=7507620522
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: psm2 (113.20)
libfabric:15330:core:core:ofi_register_provider():502<info> "psm2" filtered by provider include/exclude list, skipping
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: psm2 (113.20)
libfabric:27577:core:core:ofi_register_provider():502<info> "psm2" filtered by provider include/exclude list, skipping
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:15330:psm3:core:fi_prov_ini():752<info> build options: VERSION=1102.0=11.2.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:27577:psm3:core:fi_prov_ini():752<info> build options: VERSION=1102.0=11.2.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: psm3 (1102.0)
libfabric:15330:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: psm3 (1102.0)
libfabric:27577:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:27577:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:27577:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:27577:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:27577:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:15330:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:15330:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:15330:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:15330:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:15330:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:27577:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:15330:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:27577:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:15330:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:15330:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:15330:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_ZE not supported
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: shm (114.0)
libfabric:15330:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:27577:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:27577:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:27577:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_ZE not supported
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: shm (114.0)
libfabric:27577:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:15330:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:15330:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:15330:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:27577:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:27577:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:27577:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:15330:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:15330:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.5.2
libfabric:15330:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
libfabric:15330:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:15330:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:15330:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:15330:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.5.2
libfabric:15330:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:27577:mlx:core:mlx_getinfo():254<info> used inject size = 1024
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"libfabric:27577:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.5.2

libfabric:27577:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:15330:mlx:core:mlx_fabric_open():172<info>
libfabric:27577:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:27577:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:15330:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:15330:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:15330:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:27577:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:27577:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.5.2
libfabric:27577:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:27577:mlx:core:mlx_fabric_open():172<info>
libfabric:27577:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:27577:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:27577:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:15330:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:15330:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:27577:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:27577:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:15330:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [208]...
[0] MPI startup(): addrnamelen: 1024
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=put kvsname=kvs_15325_0 key=bc-0 value=mpi#BFEC32133B30061C002000B43977CC2B3200F8D74F00000000004F030088CA228E3D79F3E65421032119004D48B00FA133EF483850B00F213526DF0A0001008C1B5C7CE133EF483850E1BE233526030F008395870043083F5BC8CB39821E147DB477CC2B3200F8D74F77CCAB33004F13009005001200000000000060E7430A7F0000440830ED151800BA8C827DB477CC2B3200F8D74F77CCAB33004F130090D6000000E23B00000030E7430A7F00002588EFDE9CF7D27548EBA4A995BFD63400242E5077CCAB330092010084E23B0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000$
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=barrier_in
libfabric:27577:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [208]...
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=put kvsname=kvs_15325_0 key=bc-1 value=mpi#1ED28FAB24B31075002000B43977CC2B3200F8D74F00000000004F0300887F6FB14FB4ECBC2821032117004D48B00FA133EF483850B00F213526DF0A0001008C1B5C7CE133EF483850E1BE233526030F0083635E0043083F64339D6D42E7C27DB477CC2B3200F8D74F77CCAB33004F13009035000C000000000000A06928657F00004408303D73427918E1797DB477CC2B3200F8D74F77CCAB33004F130090A6000000B96B000000706928657F00002588EFB77A4203815034A4A995BFD63400242E5077CCAB330092010084B96B0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000$
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=barrier_out
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get kvsname=kvs_15325_0 key=bc-0
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=mpi#BFEC32133B30061C002000B43977CC2B3200F8D74F00000000004F030088CA228E3D79F3E65421032119004D48B00FA133EF483850B00F213526DF0A0001008C1B5C7CE133EF483850E1BE233526030F008395870043083F5BC8CB39821E147DB477CC2B3200F8D74F77CCAB33004F13009005001200000000000060E7430A7F0000440830ED151800BA8C827DB477CC2B3200F8D74F77CCAB33004F130090D6000000E23B00000030E7430A7F00002588EFDE9CF7D27548EBA4A995BFD63400242E5077CCAB330092010084E23B0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000$
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=barrier_out
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=get kvsname=kvs_15325_0 key=bc-1
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=mpi#1ED28FAB24B31075002000B43977CC2B3200F8D74F00000000004F0300887F6FB14FB4ECBC2821032117004D48B00FA133EF483850B00F213526DF0A0001008C1B5C7CE133EF483850E1BE233526030F0083635E0043083F64339D6D42E7C27DB477CC2B3200F8D74F77CCAB33004F13009035000C000000000000A06928657F00004408303D73427918E1797DB477CC2B3200F8D74F77CCAB33004F130090A6000000B96B000000706928657F00002588EFB77A4203815034A4A995BFD63400242E5077CCAB330092010084B96B0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000$
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get kvsname=kvs_15325_0 key=bc-0
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=mpi#BFEC32133B30061C002000B43977CC2B3200F8D74F00000000004F030088CA228E3D79F3E65421032119004D48B00FA133EF483850B00F213526DF0A0001008C1B5C7CE133EF483850E1BE233526030F008395870043083F5BC8CB39821E147DB477CC2B3200F8D74F77CCAB33004F13009005001200000000000060E7430A7F0000440830ED151800BA8C827DB477CC2B3200F8D74F77CCAB33004F130090D6000000E23B00000030E7430A7F00002588EFDE9CF7D27548EBA4A995BFD63400242E5077CCAB330092010084E23B0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000$
libfabric:15330:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x7f2000105f00
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=get kvsname=kvs_15325_0 key=bc-1
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=get_result rc=0 msg=success value=mpi#1ED28FAB24B31075002000B43977CC2B3200F8D74F00000000004F0300887F6FB14FB4ECBC2821032117004D48B00FA133EF483850B00F213526DF0A0001008C1B5C7CE133EF483850E1BE233526030F0083635E0043083F64339D6D42E7C27DB477CC2B3200F8D74F77CCAB33004F13009035000C000000000000A06928657F00004408303D73427918E1797DB477CC2B3200F8D74F77CCAB33004F130090A6000000B96B000000706928657F00002588EFB77A4203815034A4A995BFD63400242E5077CCAB330092010084B96B0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000$
libfabric:27577:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x7f20001e46c0
libfabric:15330:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:15330:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x7f2000105f00
libfabric:27577:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:27577:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x7f20001e46c0
libfabric:15330:mlx:core:mlx_av_insert():189<warn> address inserted
[0] MPI startup(): selected platform: icx
libfabric:27577:mlx:core:mlx_av_insert():189<warn> address inserted
[1] MPI startup(): selected platform: icx
[0] MPI startup(): File "/tuning_skx_ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/tuning_skx_ofi.dat"
[0] MPI startup(): File "/tuning_skx_ofi.dat" not found
[0] MPI startup(): Looking for tuning file: "/tuning_clx-ap_ofi_mlx.dat"
[0] MPI startup(): Looking for tuning file: "/tuning_skx_ofi_mlx.dat"
[0] MPI startup(): Looking for tuning file: "/tuning_generic_ofi_mlx.dat"
[0] MPI startup(): Looking for tuning file: "/tuning_clx-ap_ofi.dat"
[0] MPI startup(): Looking for tuning file: "/tuning_skx_ofi.dat"
[0] MPI startup(): Looking for tuning file: "/tuning_generic_ofi.dat"
[0] MPI startup(): File "/tuning_skx_ofi.dat" not found
[0] MPI startup(): File "" not found
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): File "" not found
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): File "" not found
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: num_pools: 1
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 15330 asrv0de102.corpdir.zz {0,1,2,3,4,5,6,7,8}
[0] MPI startup(): 1 27577 asrv0de103.corpdir.zz {0,1,2,3,4,5,6,7,8}
[0] MPI startup(): I_MPI_HYDRA_DEBUG=500
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BSTRAP_KEEP_ALIVE=1
[0] MPI startup(): I_MPI_PIN_DOMAIN=numa
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_OFI_PROVIDER=mlxI'm number 1 of 2.
[0] MPI startup(): I_MPI_CBWR=2
[0] MPI startup(): I_MPI_DEBUG=500
I'm number 0 of 2.
Proc 0: Message sent to 1 processes.
Proc 1: Received data 1 from 0 with tag 0.
[proxy:0:0@asrv0de102.corpdir.zz] pmi cmd from fd 6: cmd=finalize
[proxy:0:0@asrv0de102.corpdir.zz] PMI response: cmd=finalize_ack
[proxy:0:1@asrv0de103.corpdir.zz] pmi cmd from fd 4: cmd=finalize
[proxy:0:1@asrv0de103.corpdir.zz] PMI response: cmd=finalize_ack
libfabric:27577:psm2:core:psmx2_fini():656<info>
libfabric:15330:psm2:core:psmx2_fini():656<info>
libfabric:27577:psm3:core:psmx3_fini():715<info>
libfabric:15330:psm3:core:psmx3_fini():715<info>
[mpiexec@asrv0de102.corpdir.zz] Exit codes: [asrv0de102:0] 0
[asrv0de103:0] 0

See also the attached fi_info outputs for 2021.7/2021.9/2021.10 on both systems.

The problem is the following with 2021.10:
Can't find provider with the highest priority

What has changed here from 2021.7/2021.9 to 2021.10?
With the old libraries everything works fine on both systems!

Best regards
Frank

0 Kudos
3 Replies
RabiyaSK_Intel
Moderator
1,468 Views

Hi,


Thanks for posting in Intel Communities.


We regret to inform you that RHEL 7.9 isn't supported with Intel MPI Library and we also don't provide support for AMD processors. Could you please go through the below link for supported OS and processors with Intel MPI Library:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html


Could you please try on supported systems and reach out to us if you still face any problems.


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Moderator
1,384 Views

Hi,


We haven't heard back from you. Could you please confirm if you are facing the same problem on supported Operating Systems and hardware?


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Moderator
1,330 Views

Hi,


We have not heard back from you. If you need any additional information, you can post a question in communities as this thread will no longer be monitored by Intel.


Thanks & Regards,

Shaik Rabiya


0 Kudos
Reply