- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to get the Intel MPI installed as part of OneAPI 2024.2.1 to work. I have a Red Hat Enterprise Linux 9.4 installation, kernel version 5.14.0-427.42.1.el9_4.x86_64, with the Mellanox OFED version MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.4-ext. The ibstat utility shows that mlx5_0 is active and using the Infinband link layer. I can successfully run ibping and communicate with another node on the IB fabric.
These are my installation paths for OneAPI
ONEAPI_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1
I_MPI_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/mpi/2021.13
DPL_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/dpl/2022.6
CMPLR_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/compiler/2024.2
I do the setup of OneAPI with source /gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/setvars.sh
I am using two very simple MPI programs to test: One is hello.c and the other does naive numerical integration. Both work. To stick to the simplest example, I compile hello.c with
$ mpiicx -o hello ../hello.c
$ mpirun -np 1 ./hello
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 55105 RUNNING AT node304.cluster
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
If I do not set the fabric provider, or set it to any other value than shm, I get that seg fault. Running this works fine.
$ export I_MPI_FABRICS=shm
$ mpirun -np 1 ./hello
Hello world from processor node304.cluster, rank 0 out of 1 processors
$ mpirun -np 2 ./hello
Hello world from processor node304.cluster, rank 1 out of 2 processors
Hello world from processor node304.cluster, rank 0 out of 2 processors
etc.
I have seen many different suggestions for what to try to get more information and to get things to work. Adding I_MPI_DEBUG shows some more information.
$ I_MPI_DEBUG=15 mpirun -np 1 ./hello 2>&1 | grep -v 'not supported'
[0] MPI startup(): Intel(R) MPI Library, Version 2021.13 Build 20240701 (id: 179630a)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.20.1-impi
libfabric:57838:1758824622::core:core:ze_hmem_dl_init():524<warn> Failed to dlopen libze_loader.so
libfabric:57838:1758824622::core:core:ofi_hmem_init():612<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57838:1758824622::core:core:ze_hmem_dl_init():524<warn> Failed to dlopen libze_loader.so
libfabric:57838:1758824622::core:core:ofi_hmem_init():612<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: verbs (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: verbs (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: tcp (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: shm (120.10)
libfabric:57838:1758824622::core:core:ze_hmem_dl_init():524<warn> Failed to dlopen libze_loader.so
libfabric:57838:1758824622::core:core:ofi_hmem_init():612<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: ofi_rxm (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: psm2 (120.10)
libfabric:57838:1758824622::psm3:core:fi_prov_ini():939<info> node304.cluster:rank0: build options: VERSION=706.0=7.6.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94<info> node304.cluster:rank0: variable FI_PSM3_NAME_SERVER=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94<info> node304.cluster:rank0: variable FI_PSM3_TAGGED_RMA=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_str():128<info> node304.cluster:rank0: read string var FI_PSM3_UUID=eae10000-0d4a-d644-a43f-0600f4993154
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():113<info> node304.cluster:rank0: read int var FI_PSM3_DELAY=0
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():109<info> node304.cluster:rank0: variable FI_PSM3_TIMEOUT=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():109<info> node304.cluster:rank0: variable FI_PSM3_PROG_INTERVAL=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_str():124<info> node304.cluster:rank0: variable FI_PSM3_PROG_AFFINITY=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():113<info> node304.cluster:rank0: read int var FI_PSM3_INJECT_SIZE=32768
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():113<info> node304.cluster:rank0: read int var FI_PSM3_LOCK_LEVEL=0
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94<info> node304.cluster:rank0: variable FI_PSM3_LAZY_CONN=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():109<info> node304.cluster:rank0: variable FI_PSM3_CONN_TIMEOUT=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94<info> node304.cluster:rank0: variable FI_PSM3_DISCONNECT=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_str():124<info> node304.cluster:rank0: variable FI_PSM3_TAG_LAYOUT=<not set>
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94<info> node304.cluster:rank0: variable FI_PSM3_YIELD_MODE=<not set>
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: psm3 (706.0)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: mlx (1.4)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: ofi_hook_noop (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513<info> registering provider: off_coll (120.10)
libfabric:57838:1758824622::core:core:fi_getinfo_():1368<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:57838:1758824622::core:core:fi_getinfo_():1437<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 57838 RUNNING AT node304.cluster
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
So, it seems from the registering provider lines above that there are providers being detected. I tried changing the provider by running
$ I_MPI_DEBUG=15 FI_PROVIDER=mlx mpirun -np 1 ./hello
[ . . . . ]
libfabric:58832:1758825483::core:core:ofi_register_provider():513<info> registering provider: mlx (1.4)
libfabric:58832:1758825483::core:core:ofi_hmem_init():607<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607<info> Hmem iface FI_HMEM_ZE not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:58832:1758825483::core:core:ofi_register_provider():513<info> registering provider: ofi_hook_noop (120.10)
libfabric:58832:1758825483::core:core:ofi_register_provider():513<info> registering provider: off_coll (120.10)
libfabric:58832:1758825483::core:core:fi_getinfo_():1368<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
libfabric:58832:1758825483::core:core:fi_getinfo_():1368<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:58832:1758825483::core:core:fi_fabric_():1665<info> Opened fabric: mlx
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 58832 RUNNING AT node304.cluster
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
So, it seems to see that I've asked for a different provider, now using mlx, and is says it Opened fabric: mlx, but it once again Seg faults.
As mentioned above, this fails with any fabric specified other than shm. I will also observe that the same binary produced by the above mpiicx command runs fine on a node without an Infiniband card.
$ mpirun ./hello
Hello world from processor node253.cluster, rank 1 out of 4 processors
Hello world from processor node253.cluster, rank 0 out of 4 processors
Hello world from processor node254.cluster, rank 2 out of 4 processors
Hello world from processor node254.cluster, rank 3 out of 4 processors
This seems to me to indicate that there is something specifically to do with the IB fabric providers and/or the software I have installed interacting (or not interacting) with Intel MPI that is the source of the problem.
Can someone help me both understand what the problem is, and figure out how to resolve it?
Thanks in advance, -- bennet
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the detailed issue report!
Can you please try reproducing the issue with newer Intel MPI version, e.g. 2021.16 or 2021.16.1?
Also, other than I_MPI_DEBUG, do you have some other I_MPI_* environment variables defined?
Finally, it would be useful to get a back trace of the process that is segfaulting, e.g. you could enable core files generation, run gdb </path/to/executable_that_crashes> </path/to/core_file>, and type 'bt' command to generate the backtrace.
Best regards,
Sergey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page