<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: MPI segfaults with fabric other than shm in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719127#M12225</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for the detailed issue report!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you please try reproducing the issue with newer Intel MPI version, e.g. 2021.16 or 2021.16.1?&lt;/P&gt;&lt;P&gt;Also, other than&amp;nbsp;&lt;SPAN&gt;I_MPI_DEBUG, do you have some other I_MPI_* environment variables defined?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Finally, it would be useful to get a back trace of the process that is segfaulting, e.g. you could enable core files generation, run gdb &amp;lt;/path/to/executable_that_crashes&amp;gt; &amp;lt;/path/to/core_file&amp;gt;, and type 'bt' command to generate the backtrace.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Best regards,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Sergey&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 25 Sep 2025 20:15:05 GMT</pubDate>
    <dc:creator>Sergey_K_Intel3</dc:creator>
    <dc:date>2025-09-25T20:15:05Z</dc:date>
    <item>
      <title>MPI segfaults with fabric other than shm</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719120#M12224</link>
      <description>&lt;P&gt;I am trying to get the Intel MPI installed as part of OneAPI 2024.2.1 to work.&amp;nbsp; I have a Red Hat Enterprise Linux 9.4 installation,&amp;nbsp; kernel version 5.14.0-427.42.1.el9_4.x86_64, with the Mellanox OFED version MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.4-ext.&amp;nbsp; The ibstat utility shows that mlx5_0 is active and&amp;nbsp; using the Infinband link layer.&amp;nbsp; I can successfully run ibping and communicate with another node on the IB fabric.&lt;BR /&gt;&lt;BR /&gt;These are my installation paths for OneAPI&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;ONEAPI_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1
I_MPI_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/mpi/2021.13
DPL_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/dpl/2022.6
CMPLR_ROOT=/gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/compiler/2024.2&lt;/LI-CODE&gt;&lt;P&gt;I do the setup of OneAPI with&amp;nbsp;source /gpfs1/sw/rh9/pkgs/oneapi/2024.2.1/setvars.sh&lt;/P&gt;&lt;P&gt;I am using two very simple MPI programs to test:&amp;nbsp; One is hello.c and the other does naive numerical integration.&amp;nbsp; Both work.&amp;nbsp; To stick to the simplest example,&amp;nbsp; I compile hello.c with&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;$ mpiicx -o hello ../hello.c
$ mpirun -np 1 ./hello

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 55105 RUNNING AT node304.cluster
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================&lt;/LI-CODE&gt;&lt;P&gt;If I do not set the fabric provider, or set it to any other value than shm, I get that seg fault.&amp;nbsp; Running this works fine.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;$ export I_MPI_FABRICS=shm
$ mpirun -np 1 ./hello
Hello world from processor node304.cluster, rank 0 out of 1 processors

$ mpirun -np 2 ./hello
Hello world from processor node304.cluster, rank 1 out of 2 processors
Hello world from processor node304.cluster, rank 0 out of 2 processors
etc.&lt;/LI-CODE&gt;&lt;P&gt;I have seen many different suggestions for what to try to get more information and to get things to work.&amp;nbsp; Adding I_MPI_DEBUG shows some more information.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;$ I_MPI_DEBUG=15 mpirun -np 1 ./hello 2&amp;gt;&amp;amp;1 | grep -v 'not supported'
[0] MPI startup(): Intel(R) MPI Library, Version 2021.13  Build 20240701 (id: 179630a)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1 
[0] MPI startup(): libfabric version: 1.20.1-impi
libfabric:57838:1758824622::core:core:ze_hmem_dl_init():524&amp;lt;warn&amp;gt; Failed to dlopen libze_loader.so
libfabric:57838:1758824622::core:core:ofi_hmem_init():612&amp;lt;warn&amp;gt; Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57838:1758824622::core:core:ze_hmem_dl_init():524&amp;lt;warn&amp;gt; Failed to dlopen libze_loader.so
libfabric:57838:1758824622::core:core:ofi_hmem_init():612&amp;lt;warn&amp;gt; Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: verbs (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: verbs (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: tcp (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: shm (120.10)
libfabric:57838:1758824622::core:core:ze_hmem_dl_init():524&amp;lt;warn&amp;gt; Failed to dlopen libze_loader.so
libfabric:57838:1758824622::core:core:ofi_hmem_init():612&amp;lt;warn&amp;gt; Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: ofi_rxm (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: psm2 (120.10)
libfabric:57838:1758824622::psm3:core:fi_prov_ini():939&amp;lt;info&amp;gt; node304.cluster:rank0: build options: VERSION=706.0=7.6.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_NAME_SERVER=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_TAGGED_RMA=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_str():128&amp;lt;info&amp;gt; node304.cluster:rank0: read string var FI_PSM3_UUID=eae10000-0d4a-d644-a43f-0600f4993154
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():113&amp;lt;info&amp;gt; node304.cluster:rank0: read int var FI_PSM3_DELAY=0
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():109&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_TIMEOUT=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():109&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_PROG_INTERVAL=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_str():124&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_PROG_AFFINITY=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():113&amp;lt;info&amp;gt; node304.cluster:rank0: read int var FI_PSM3_INJECT_SIZE=32768
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():113&amp;lt;info&amp;gt; node304.cluster:rank0: read int var FI_PSM3_LOCK_LEVEL=0
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_LAZY_CONN=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_int():109&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_CONN_TIMEOUT=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_DISCONNECT=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_str():124&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_TAG_LAYOUT=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::psm3:core:psmx3_param_get_bool():94&amp;lt;info&amp;gt; node304.cluster:rank0: variable FI_PSM3_YIELD_MODE=&amp;lt;not set&amp;gt;
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: psm3 (706.0)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: mlx (1.4)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: ofi_hook_noop (120.10)
libfabric:57838:1758824622::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: off_coll (120.10)
libfabric:57838:1758824622::core:core:fi_getinfo_():1368&amp;lt;info&amp;gt; Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:57838:1758824622::core:core:fi_getinfo_():1437&amp;lt;info&amp;gt; Start regular provider search because provider with the highest priority psm2 can not be initialized

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 57838 RUNNING AT node304.cluster
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================&lt;/LI-CODE&gt;&lt;P&gt;So, it seems from the registering provider lines above that there are providers being detected.&amp;nbsp; I tried changing the provider by running&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;$ I_MPI_DEBUG=15 FI_PROVIDER=mlx mpirun -np 1 ./hello
    [ . . . . ]
libfabric:58832:1758825483::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: mlx (1.4)
libfabric:58832:1758825483::core:core:ofi_hmem_init():607&amp;lt;info&amp;gt; Hmem iface FI_HMEM_CUDA not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607&amp;lt;info&amp;gt; Hmem iface FI_HMEM_ROCR not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607&amp;lt;info&amp;gt; Hmem iface FI_HMEM_ZE not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607&amp;lt;info&amp;gt; Hmem iface FI_HMEM_NEURON not supported
libfabric:58832:1758825483::core:core:ofi_hmem_init():607&amp;lt;info&amp;gt; Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:58832:1758825483::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: ofi_hook_noop (120.10)
libfabric:58832:1758825483::core:core:ofi_register_provider():513&amp;lt;info&amp;gt; registering provider: off_coll (120.10)
libfabric:58832:1758825483::core:core:fi_getinfo_():1368&amp;lt;info&amp;gt; Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
libfabric:58832:1758825483::core:core:fi_getinfo_():1368&amp;lt;info&amp;gt; Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:58832:1758825483::core:core:fi_fabric_():1665&amp;lt;info&amp;gt; Opened fabric: mlx

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 58832 RUNNING AT node304.cluster
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================&lt;/LI-CODE&gt;&lt;P&gt;So, it seems to see that I've asked for a different provider, now using mlx, and is says it Opened fabric: mlx, but it once again Seg faults.&lt;/P&gt;&lt;P&gt;As mentioned above, this fails with any fabric specified other than shm.&amp;nbsp; I will also observe that the same binary produced by the above mpiicx command runs fine on a node &lt;STRONG&gt;without&lt;/STRONG&gt; an Infiniband card.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;$ mpirun ./hello 
Hello world from processor node253.cluster, rank 1 out of 4 processors
Hello world from processor node253.cluster, rank 0 out of 4 processors
Hello world from processor node254.cluster, rank 2 out of 4 processors
Hello world from processor node254.cluster, rank 3 out of 4 processors&lt;/LI-CODE&gt;&lt;P&gt;This seems to me to indicate that there is something specifically to do with the IB fabric providers and/or the software I have installed interacting (or not interacting) with Intel MPI that is the source of the problem.&lt;/P&gt;&lt;P&gt;Can someone help me both understand what the problem is, and figure out how to resolve it?&lt;/P&gt;&lt;P&gt;Thanks in advance,&amp;nbsp; &amp;nbsp; -- bennet&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Sep 2025 18:41:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719120#M12224</guid>
      <dc:creator>Bennet</dc:creator>
      <dc:date>2025-09-25T18:41:46Z</dc:date>
    </item>
    <item>
      <title>Re: MPI segfaults with fabric other than shm</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719127#M12225</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for the detailed issue report!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you please try reproducing the issue with newer Intel MPI version, e.g. 2021.16 or 2021.16.1?&lt;/P&gt;&lt;P&gt;Also, other than&amp;nbsp;&lt;SPAN&gt;I_MPI_DEBUG, do you have some other I_MPI_* environment variables defined?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Finally, it would be useful to get a back trace of the process that is segfaulting, e.g. you could enable core files generation, run gdb &amp;lt;/path/to/executable_that_crashes&amp;gt; &amp;lt;/path/to/core_file&amp;gt;, and type 'bt' command to generate the backtrace.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Best regards,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Sergey&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Sep 2025 20:15:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719127#M12225</guid>
      <dc:creator>Sergey_K_Intel3</dc:creator>
      <dc:date>2025-09-25T20:15:05Z</dc:date>
    </item>
    <item>
      <title>Re: MPI segfaults with fabric other than shm</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719650#M12227</link>
      <description>&lt;P&gt;Hi, Sergey,&lt;/P&gt;&lt;P&gt;I installed MPI 2021.16.1, and that did not change the situation.&amp;nbsp; The only environment variables set were those set by setvars.sh.&lt;BR /&gt;&lt;BR /&gt;I went back and started over with two machines freshly installed from scratch using Red Hat 9.4 and the&amp;nbsp;&lt;SPAN class=""&gt;MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.4-ext&lt;/SPAN&gt;&amp;nbsp;set with the --base and --hpc options for the installer.&lt;/P&gt;&lt;P&gt;This time around, I noticed that the Mellanox installers suggest that running 'dracut -f' might be necessary.&amp;nbsp; I *think* that may have not been run in the first round of installation, but I definitely did run it on both nodes this time.&lt;/P&gt;&lt;P&gt;That seems to have been my problem, because after running 'dracut -f' on all nodes, MPI is now running on multiple nodes.&amp;nbsp; &lt;LI-EMOJI id="lia_slightly-smiling-face" title=":slightly_smiling_face:"&gt;&lt;/LI-EMOJI&gt;&lt;/P&gt;&lt;P&gt;Thanks for your assistance!&lt;/P&gt;&lt;P&gt;-- bennet&lt;/P&gt;</description>
      <pubDate>Mon, 29 Sep 2025 12:37:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-segfaults-with-fabric-other-than-shm/m-p/1719650#M12227</guid>
      <dc:creator>Bennet</dc:creator>
      <dc:date>2025-09-29T12:37:46Z</dc:date>
    </item>
  </channel>
</rss>

