- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had been running CentOS 7.9 and seeing that end-of-life was coming up, I have now changed the OS to Rocky Linux 9.2. I have installed the intel oneapi environment using dnf using the instructions on the Intel website https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/yum-dnf-zypper.html#GUID-BC7ED324-35F5-4EB0-8180-21991E14C07B
I have used the commands
dnf install intel-basekit
dnf install intel-hpckit
to install oneapi.
I have copied the test file "test.f90" from /opt/intel/oneapi/mpi/latest/test/test.f90 to a new directory and have set up the environment variables using source /opt/intel/oneapi/setvars.sh.
Compiling the test.f90 using "mpiifort -o testf90 test.f90" works without error, but when I attempt to run the binary, I receive memory insufficient error messages as can be seen below:
mpiexec.hydra -n 32 ./testf90
muon:rank18.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank18.testf90: Unable to allocate UD send buffer pool
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1548)..............:
MPIDI_OFI_mpi_init_hook(1632):
create_vni_context(2208).....: OFI endpoint open failed (ofi_init.c:2208:create_vni_context:Invalid argument)
There is plenty of memory in the system so it must be a software problem.
Strangely enough, if I attempt the same command as root, the program works normally:
root@muon:/data/test_mpi>mpiexec.hydra -n 32 ./testf90
Hello world: rank 0 of 32 running on
muon
Hello world: rank 1 of 32 running on
muon
Hello world: rank 2 of 32 running on
muon
Hello world: rank 3 of 32 running on
muon
Hello world: rank 4 of 32 running on
muon
Hello world: rank 5 of 32 running on
muon
Hello world: rank 6 of 32 running on
muon
Hello world: rank 7 of 32 running on
muon
Hello world: rank 8 of 32 running on
muon
Hello world: rank 9 of 32 running on
muon
Hello world: rank 10 of 32 running on
muon
Hello world: rank 11 of 32 running on
muon
Hello world: rank 12 of 32 running on
muon
Hello world: rank 13 of 32 running on
muon
Hello world: rank 14 of 32 running on
muon
Hello world: rank 15 of 32 running on
muon
Hello world: rank 16 of 32 running on
muon
Hello world: rank 17 of 32 running on
muon
Hello world: rank 18 of 32 running on
muon
Hello world: rank 19 of 32 running on
muon
Hello world: rank 20 of 32 running on
muon
Hello world: rank 21 of 32 running on
muon
Hello world: rank 22 of 32 running on
muon
Hello world: rank 23 of 32 running on
muon
Hello world: rank 24 of 32 running on
muon
Hello world: rank 25 of 32 running on
muon
Hello world: rank 26 of 32 running on
muon
Hello world: rank 27 of 32 running on
muon
Hello world: rank 28 of 32 running on
muon
Hello world: rank 29 of 32 running on
muon
Hello world: rank 30 of 32 running on
muon
Hello world: rank 31 of 32 running on
muon
Naturally I am not wanting to run just the test program, but I am getting similar messages from Vasp, a large MPI based ab-initio code that I need for research. Any idea as to what may be happening?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the same output with a more detailed debug report
mpirun -genv I_MPI_DEBUG=15 -n 32 ./testf90
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (128 MB per rank) * (32 local ranks) = 4125 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:164812:1697191819::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so
libfabric:164812:1697191819::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:164812:1697191819::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so
libfabric:164812:1697191819::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: verbs (118.0)
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ZE not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: verbs (118.0)
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: tcp (118.0)
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ZE not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: shm (118.0)
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:164812:1697191819::core:core:ze_hmem_dl_init():497<warn> Failed to dlopen libze_loader.so
libfabric:164812:1697191819::core:core:ofi_hmem_init():421<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: ofi_rxm (118.0)
libfabric:164812:1697191819::psm3:core:fi_prov_ini():921<info> muon:rank0: build options: VERSION=705.0=7.5.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_NAME_SERVER=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_TAGGED_RMA=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_str():128<info> muon:rank0: read string var FI_PSM3_UUID=c8830200-67d4-7343-9607-0600ca844f35
libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():113<info> muon:rank0: read int var FI_PSM3_DELAY=0
libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():109<info> muon:rank0: variable FI_PSM3_TIMEOUT=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():109<info> muon:rank0: variable FI_PSM3_PROG_INTERVAL=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_str():124<info> muon:rank0: variable FI_PSM3_PROG_AFFINITY=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():113<info> muon:rank0: read int var FI_PSM3_INJECT_SIZE=32768
libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():113<info> muon:rank0: read int var FI_PSM3_LOCK_LEVEL=0
libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_LAZY_CONN=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_int():109<info> muon:rank0: variable FI_PSM3_CONN_TIMEOUT=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_DISCONNECT=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_str():124<info> muon:rank0: variable FI_PSM3_TAG_LAYOUT=<not set>
libfabric:164812:1697191819::psm3:core:psmx3_param_get_bool():94<info> muon:rank0: variable FI_PSM3_YIELD_MODE=<not set>
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: psm3 (705.0)
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_ZE not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:164812:1697191819::core:core:ofi_hmem_init():416<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: ofi_hook_noop (118.0)
libfabric:164812:1697191819::core:core:ofi_register_provider():476<info> registering provider: off_coll (118.0)
libfabric:164812:1697191819::core:core:fi_getinfo_():1352<info> Found provider with the highest priority psm3, must_use_util_prov = 0
libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
libfabric:164812:1697191819::core:core:fi_getinfo_():1352<info> Found provider with the highest priority psm3, must_use_util_prov = 0
libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:164812:1697191819::core:core:fi_getinfo_():1379<info> Since psm3 can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
[0] MPI startup(): libfabric provider: psm3
libfabric:164812:1697191819::core:core:fi_fabric_():1645<info> Opened fabric: IB/OPA-0xfe80000000000000
libfabric:164812:1697191819::core:core:ofi_shm_map():173<warn> shm_open failed
muon:rank21.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank21.testf90: Unable to allocate UD send buffer pool
muon:rank30.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank30.testf90: Unable to allocate UD send buffer pool
muon:rank23.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank23.testf90: Unable to allocate UD send buffer pool
muon:rank29.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank29.testf90: Unable to allocate UD send buffer pool
muon:rank6.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank6.testf90: Unable to allocate UD send buffer pool
muon:rank18.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank18.testf90: Unable to allocate UD send buffer pool
muon:rank16.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank16.testf90: Unable to allocate UD send buffer pool
muon:rank14.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank14.testf90: Unable to allocate UD send buffer pool
muon:rank1.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank1.testf90: Unable to allocate UD send buffer pool
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1548)..............:
MPIDI_OFI_mpi_init_hook(1632):
create_vni_context(2208).....: OFI endpoint open failed (ofi_init.c:2208:create_vni_context:Invalid argument)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Problem solved. The cause was using an older version of the Mellonox driver. Updating to version 5.8-3.7.0 solved the problem. The error messages didn't really seem to point to a driver issue, but a similar post on this board gave the necessary hints.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel communities!
Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel
Regards,
Veena
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page