Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2224 Discussions

rank18.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

paul312
Beginner
1,794 Views

I had been running CentOS 7.9 and seeing that end-of-life was coming up, I have now changed the OS to Rocky Linux 9.2. I have installed the intel oneapi environment using dnf using the instructions on the Intel website https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/yum-dnf-zypper.html#GUID-BC7ED324-35F5-4EB0-8180-21991E14C07B

 

I have used the commands 

dnf install intel-basekit

dnf install intel-hpckit

to install oneapi.

 

I have copied the test file "test.f90" from /opt/intel/oneapi/mpi/latest/test/test.f90 to a new directory and have set up the environment variables using source /opt/intel/oneapi/setvars.sh.

Compiling the test.f90 using "mpiifort -o testf90 test.f90" works without error, but when I attempt to run the binary, I receive memory insufficient error messages as can be seen below:

mpiexec.hydra -n 32 ./testf90 

muon:rank18.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank18.testf90: Unable to allocate UD send buffer pool

Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(176)........: 

MPID_Init(1548)..............: 

MPIDI_OFI_mpi_init_hook(1632): 

create_vni_context(2208).....: OFI endpoint open failed (ofi_init.c:2208:create_vni_context:Invalid argument)

 

There is plenty of memory in the system so it must be a software problem.

Strangely enough, if I attempt the same command as root, the program works normally:

 

root@muon:/data/test_mpi>mpiexec.hydra -n 32 ./testf90 

 Hello world: rank            0  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            1  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            2  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            3  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            4  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            5  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            6  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            7  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            8  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank            9  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           10  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           11  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           12  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           13  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           14  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           15  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           16  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           17  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           18  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           19  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           20  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           21  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           22  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           23  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           24  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           25  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           26  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           27  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           28  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           29  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           30  of           32  running on 

 muon                                                                           

                                                 

 Hello world: rank           31  of           32  running on 

 muon                                                    

 

Naturally I am not wanting to run just the test program, but I am getting similar messages from Vasp, a large MPI based ab-initio code that I need for research. Any idea as to what may be happening?

0 Kudos
3 Replies
paul312
Beginner
1,752 Views

Here is the same output with a more detailed debug report

 

mpirun -genv I_MPI_DEBUG=15  -n 32  ./testf90 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)

[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.

[0] MPI startup(): library kind: release

[0] MPI startup(): shm segment size (128 MB per rank) * (32 local ranks) = 4125 MB total

[0] MPI startup(): libfabric loaded: libfabric.so.1 

[0] MPI startup(): libfabric version: 1.18.0-impi

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported

libfabric:164812:1697191819::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so

libfabric:164812:1697191819::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported

libfabric:164812:1697191819::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so

libfabric:164812:1697191819::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: verbs (118.0)

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ZE not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: verbs (118.0)

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: tcp (118.0)

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ZE not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: shm (118.0)

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported

libfabric:164812:1697191819::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so

libfabric:164812:1697191819::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: ofi_rxm (118.0)

libfabric:164812:1697191819::psm3:core:fi_prov_ini():921<info> muon:rank0: build options: VERSION=705.0=7.5.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0

libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_NAME_SERVER=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_TAGGED_RMA=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_str():128<info> muon:rank0: read string var FI_PSM3_UUID=c8830200-67d4-7343-9607-0600ca844f35

libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():113<info> muon:rank0: read int var FI_PSM3_DELAY=0

libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():109<info> muon:rank0: variable FI_PSM3_TIMEOUT=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():109<info> muon:rank0: variable FI_PSM3_PROG_INTERVAL=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_str():124<info> muon:rank0: variable FI_PSM3_PROG_AFFINITY=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():113<info> muon:rank0: read int var FI_PSM3_INJECT_SIZE=32768

libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():113<info> muon:rank0: read int var FI_PSM3_LOCK_LEVEL=0

libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_LAZY_CONN=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():109<info> muon:rank0: variable FI_PSM3_CONN_TIMEOUT=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_DISCONNECT=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_str():124<info> muon:rank0: variable FI_PSM3_TAG_LAYOUT=<not set>

libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_YIELD_MODE=<not set>

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: psm3 (705.0)

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ZE not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported

libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: ofi_hook_noop (118.0)

libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: off_coll (118.0)

libfabric:164812:1697191819::core:core:fi_getinfo_():1352<info> Found provider with the highest priority psm3, must_use_util_prov = 0

libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs

libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp

libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm

[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0

[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)

libfabric:164812:1697191819::core:core:fi_getinfo_():1352<info> Found provider with the highest priority psm3, must_use_util_prov = 0

libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs

libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp

libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm

[0] MPI startup(): libfabric provider: psm3

libfabric:164812:1697191819::core:core:fi_fabric_():1645<info> Opened fabric: IB/OPA-0xfe80000000000000

libfabric:164812:1697191819::core:core:ofi_shm_map():173<warn> shm_open failed

muon:rank21.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank21.testf90: Unable to allocate UD send buffer pool

muon:rank30.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank30.testf90: Unable to allocate UD send buffer pool

muon:rank23.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank23.testf90: Unable to allocate UD send buffer pool

muon:rank29.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank29.testf90: Unable to allocate UD send buffer pool

muon:rank6.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank6.testf90: Unable to allocate UD send buffer pool

muon:rank18.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank18.testf90: Unable to allocate UD send buffer pool

muon:rank16.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank16.testf90: Unable to allocate UD send buffer pool

muon:rank14.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank14.testf90: Unable to allocate UD send buffer pool

muon:rank1.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

muon:rank1.testf90: Unable to allocate UD send buffer pool

Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(176)........: 

MPID_Init(1548)..............: 

MPIDI_OFI_mpi_init_hook(1632): 

create_vni_context(2208).....: OFI endpoint open failed (ofi_init.c:2208:create_vni_context:Invalid argument)

0 Kudos
paul312
Beginner
1,724 Views

Problem solved. The cause was using an older version of the Mellonox driver. Updating to version 5.8-3.7.0 solved the problem. The error messages didn't really seem to point to a driver issue, but a similar post on this board gave the necessary hints.

0 Kudos
VeenaJ_Intel
Moderator
1,715 Views

Hi,

 

Thanks for posting in Intel communities!

 

Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel

 

Regards,

Veena

 

0 Kudos
Reply