<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Intel MPI mlx provider issue in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689284#M12142</link>
    <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/253219"&gt;@Antonio_D&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;thanks can you please also attach the output of the failing srun command, e.g.:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;I_MPI_DEBUG=30 I_MPI_HYDRA_DEBUG=1 srun IMB-MPI1 allreduce
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 12 May 2025 15:47:25 GMT</pubDate>
    <dc:creator>TobiasK</dc:creator>
    <dc:date>2025-05-12T15:47:25Z</dc:date>
    <item>
      <title>Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1683764#M12127</link>
      <description>&lt;P&gt;I have a program that I have compiled with oneAPI 2025.1 that will run just fine using&amp;nbsp;I_MPI_OFI_PROVIDER=verbs (or any other provider really), but will not run with&amp;nbsp;I_MPI_OFI_PROVIDER=mlx.&lt;/P&gt;&lt;P&gt;I_MPI_DEBUG=30 output:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;[0] MPI startup(): PMI API: pmix
[0] MPI startup(): PMIx version: OpenPMIx 5.0.7 (PMIx Standard: 5.1, Stable ABI: 5.0, Provisional ABI: 5.0)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.15  Build 20250213 (id: d233448)
[0] MPI startup(): Copyright (C) 2003-2025 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1 
[0] MPI startup(): libfabric version: 1.21.0-impi
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: verbs (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557&amp;lt;info&amp;gt; "verbs" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: verbs (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557&amp;lt;info&amp;gt; "verbs" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: tcp (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557&amp;lt;info&amp;gt; "tcp" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: shm (200.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557&amp;lt;info&amp;gt; "shm" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: ofi_rxm (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: psm2 (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557&amp;lt;info&amp;gt; "psm2" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: psm3 (707.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():557&amp;lt;info&amp;gt; "psm3" filtered by provider include/exclude list, skipping
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: mlx (1.4)
libfabric:1780409:1744907435::core:core:ofi_reg_dl_prov():675&amp;lt;warn&amp;gt; dlopen(/projects/site/gred/smpg/software/oneAPI/2025.1/mpi/2021.15/opt/mpi/libfabric/lib/prov/libefa-fi.so): libefa.so.1: cannot open shared object file: No such file or directory
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: ofi_hook_noop (121.0)
libfabric:1780409:1744907435::core:core:ofi_register_provider():530&amp;lt;info&amp;gt; registering provider: off_coll (121.0)
libfabric:1780409:1744907435::core:core:fi_getinfo_():1449&amp;lt;info&amp;gt; Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): max_ch4_vnis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 0
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
libfabric:1780409:1744907435::core:core:fi_getinfo_():1449&amp;lt;info&amp;gt; Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:1780409:1744907435::core:core:fi_fabric_():1745&amp;lt;info&amp;gt; Opened fabric: mlx
libfabric:1780409:1744907435::core:core:fi_fabric_():1756&amp;lt;info&amp;gt; Using mlx provider 1.21, path:/projects/site/gred/smpg/software/oneAPI/2025.1/mpi/2021.15/opt/mpi/libfabric/lib/prov/libmlx-fi.so
[0] MPI startup(): addrnamelen: 1024
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(196)........: 
MPID_Init(1719)..............: 
MPIDI_OFI_mpi_init_hook(1741): 
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(196)........: 
MPID_Init(1719)..............: 
MPIDI_OFI_mpi_init_hook(1741): 
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
slurmstepd: error: *** STEP 16814.0 ON sc1nc124 CANCELLED AT 2025-04-17T09:30:36 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: sc1nc124: tasks 0-11: Killed&lt;/LI-CODE&gt;&lt;P&gt;ucx_info -v:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;# Library version: 1.16.0
# Library path: /usr/lib64/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision 02432d3
# Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-12.2&lt;/LI-CODE&gt;&lt;P&gt;ucx_info -d | grep Transport:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: xpmem&lt;/LI-CODE&gt;&lt;P&gt;mlx looks like it should be available from all of the outputs I see.&amp;nbsp; This looks like a cluster configuration issue, but I don't know where to start troubleshooting.&amp;nbsp; SLURM job scheduler is in use.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Apr 2025 18:27:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1683764#M12127</guid>
      <dc:creator>Antonio_D</dc:creator>
      <dc:date>2025-04-17T18:27:39Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689258#M12139</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/253219"&gt;@Antonio_D&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Can you please provide the output of&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;fi_info&lt;/LI-CODE&gt;
&lt;P&gt;and also (without setting FI_PROVIDER)&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;I_MPI_DEBUG=30 I_MPI_HYDRA_DEBUG=1 mpirun IMB-MPI1 allreduce&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 12 May 2025 13:34:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689258#M12139</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2025-05-12T13:34:33Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689276#M12141</link>
      <description>&lt;P&gt;See the attached files.&amp;nbsp; We have recently found that the mlx provider works properly with mpirun, but not srun (when submitting to SLURM using an sbatch script).&lt;/P&gt;</description>
      <pubDate>Mon, 12 May 2025 15:36:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689276#M12141</guid>
      <dc:creator>Antonio_D</dc:creator>
      <dc:date>2025-05-12T15:36:35Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689284#M12142</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/253219"&gt;@Antonio_D&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;thanks can you please also attach the output of the failing srun command, e.g.:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;I_MPI_DEBUG=30 I_MPI_HYDRA_DEBUG=1 srun IMB-MPI1 allreduce
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 12 May 2025 15:47:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689284#M12142</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2025-05-12T15:47:25Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689296#M12143</link>
      <description>&lt;P&gt;See attached.&amp;nbsp; When I run it with pmix using the following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;I_MPI_DEBUG=30 I_MPI_HYDRA_DEBUG=1 I_MPI_PMI_LIBRARY=/.../software/pmix/lib/libpmix.so srun --mpi=pmix IMB-MPI1 allreduce&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I get the error, also attached.&lt;/P&gt;</description>
      <pubDate>Mon, 12 May 2025 16:06:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1689296#M12143</guid>
      <dc:creator>Antonio_D</dc:creator>
      <dc:date>2025-05-12T16:06:21Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690359#M12151</link>
      <description>&lt;P&gt;Can you please also provide the log with I_MPI_DEBUG=1000 ?&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 08:54:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690359#M12151</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2025-05-16T08:54:34Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690418#M12152</link>
      <description>&lt;P&gt;See attached for files with&amp;nbsp;&lt;SPAN&gt;I_MPI_DEBUG=1000&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 15:57:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690418#M12152</guid>
      <dc:creator>Antonio_D</dc:creator>
      <dc:date>2025-05-16T15:57:16Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690423#M12153</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/253219"&gt;@Antonio_D&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;can you please try to set&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;I_MPI_HYDRA_IFACE=ib0&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also since it is working with mpirun, is slurm correctly setup, e.g. are you setting:&lt;/P&gt;
&lt;P&gt;&lt;I&gt;PropagateResourceLimitsExcept=MEMLOCK&lt;/I&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;A href="https://slurm.schedmd.com/faq.html#memlock" target="_blank"&gt;https://slurm.schedmd.com/faq.html#memlock&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 16:21:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690423#M12153</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2025-05-16T16:21:57Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI mlx provider issue</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690424#M12154</link>
      <description>&lt;P&gt;Same result with&amp;nbsp;&lt;SPAN&gt;I_MPI_HYDRA_IFACE=ib0&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 16:20:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-mlx-provider-issue/m-p/1690424#M12154</guid>
      <dc:creator>Antonio_D</dc:creator>
      <dc:date>2025-05-16T16:20:37Z</dc:date>
    </item>
  </channel>
</rss>

