<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638759#M11949</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This problem is not reproduced in our internal system.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Why don't you refer to the below URL and configure '&lt;/SPAN&gt;I_MPI_PMI_LIBRARY'&amp;nbsp;&lt;SPAN&gt;environment?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/other-environment-variables.html#GUID-6B9D4E5C-8582-42E6-B7DA-72C87622357D" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/other-environment-variables.html#GUID-6B9D4E5C-8582-42E6-B7DA-72C87622357D&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 23 Oct 2024 05:25:50 GMT</pubDate>
    <dc:creator>taehunkim</dc:creator>
    <dc:date>2024-10-23T05:25:50Z</dc:date>
    <item>
      <title>MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1637148#M11938</link>
      <description>&lt;P&gt;Hello!&lt;BR /&gt;&lt;BR /&gt;I'm trying to use one sided communications for load balancing with MPI.&amp;nbsp;&lt;BR /&gt;The algorithm steals some jobs from other MPI threads. For this, it makes MPI_Win_lock, MPI_get, some computing, MPI_put, MPI_Win_unlock. For the rank, which owns memory, it works fine, but call MPI_Put in other ranks leads to&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;[vn01:3300889:0:3300889] ib_mlx5_log.c:177  Remote access on mlx5_bond_0:1/RoCE (synd 0x13 vend 0x88 hw_synd 0/0)
[vn01:3300889:0:3300889] ib_mlx5_log.c:177  RC QP 0x2f537 wqe[3]: RDMA_WRITE --- [rva 0x7fd2d7be90a8 rkey 0x2e76c7] [inl len 16] [rqpn 0x2f541 dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.152.0.10 sgid_index=3 traffic_class=0]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With a nice backtrace:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Image              PC                Routine            Line        Source
libpthread-2.28.s  00007FD0A06EECF0  Unknown               Unknown  Unknown
libc-2.28.so       00007FD09FBD7ACF  gsignal               Unknown  Unknown
libc-2.28.so       00007FD09FBAAEA5  abort                 Unknown  Unknown
libucs.so.0.0.0    00007FD09C6922E6  Unknown               Unknown  Unknown
libucs.so.0.0.0    00007FD09C6974F4  ucs_log_default_h     Unknown  Unknown
libucs.so.0.0.0    00007FD09C697814  ucs_log_dispatch      Unknown  Unknown
libuct_ib.so.0.0.  00007FD09BD314FA  uct_ib_mlx5_compl     Unknown  Unknown
libuct_ib.so.0.0.  00007FD09BD483A0  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  00007FD09BD32F9D  uct_ib_mlx5_check     Unknown  Unknown
libuct_ib.so.0.0.  00007FD09BD463AA  Unknown               Unknown  Unknown
libucp.so.0.0.0    00007FD09CC5282A  ucp_worker_progre     Unknown  Unknown
libucp.so.0.0.0    00007FD09CC6A318  ucp_worker_flush      Unknown  Unknown
libmlx-fi.so       00007FD09CEDD50D  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FD0A0F5205F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FD0A0F64E46  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FD0A0F43593  PMPI_Win_unlock       Unknown  Unknown
libmpifort.so.12.  00007FD0AAC4A01D  mpi_win_unlock__      Unknown  Unknown
a.out              0000000000405B9B  Unknown               Unknown  Unknown
a.out              0000000000405DCC  Unknown               Unknown  Unknown
a.out              00000000004052AD  Unknown               Unknown  Unknown
libc-2.28.so       00007FD09FBC3D85  __libc_start_main     Unknown  Unknown
a.out              00000000004051CE  Unknown               Unknown  Unknown&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The same code works fine with OpenMPI, and also, replacing of MPI_Put to MPI_Accumulate is also works fine. You can try to uncomment lines 170 and 216 with removing of MPI_Put calls.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;In the attachment you will find not minimal, but relatively clear example which leads to failing. It will fail for 2-5 MPI ranks because of task scheduling.&lt;BR /&gt;&lt;BR /&gt;compilation:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;mpiifx put.f90 -cpp&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;running:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;mpirun -n 4 ./a.out&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;Used MPI version 2021.13 and IFX 2024.2.1 (from the latest HPC toolkit)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Igor&lt;/P&gt;</description>
      <pubDate>Mon, 14 Oct 2024 18:01:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1637148#M11938</guid>
      <dc:creator>foxtran</dc:creator>
      <dc:date>2024-10-14T18:01:54Z</dc:date>
    </item>
    <item>
      <title>Re: MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638759#M11949</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This problem is not reproduced in our internal system.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Why don't you refer to the below URL and configure '&lt;/SPAN&gt;I_MPI_PMI_LIBRARY'&amp;nbsp;&lt;SPAN&gt;environment?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/other-environment-variables.html#GUID-6B9D4E5C-8582-42E6-B7DA-72C87622357D" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/other-environment-variables.html#GUID-6B9D4E5C-8582-42E6-B7DA-72C87622357D&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Oct 2024 05:25:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638759#M11949</guid>
      <dc:creator>taehunkim</dc:creator>
      <dc:date>2024-10-23T05:25:50Z</dc:date>
    </item>
    <item>
      <title>Re: MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638814#M11951</link>
      <description>&lt;P&gt;Hmm... Are there some ways to get more details about this error on my machine?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Oct 2024 09:19:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638814#M11951</guid>
      <dc:creator>foxtran</dc:creator>
      <dc:date>2024-10-23T09:19:22Z</dc:date>
    </item>
    <item>
      <title>Re: MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638946#M11953</link>
      <description>&lt;P class=""&gt;Hi,&lt;/P&gt;&lt;P class=""&gt;Here are some steps to troubleshoot and resolve this issue:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Check OFI Providers: Ensure that the necessary OFI providers are installed on your system. You can check the available providers by running:&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;$ fi_info&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;This command should list the available fabric interfaces. If it returns "No data available," it means no suitable providers are found.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Configure Intel MPI to Use a Specific Provider: Sometimes, specifying a particular provider can help. You can set the&amp;nbsp;FI_PROVIDER&amp;nbsp;environment variable to a specific provider that is available on your system.&amp;nbsp; For example:&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; $ export FI_PROVIDER=sockets&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;You can add this line to your Slurm job script before the&amp;nbsp;mpirun&amp;nbsp;or&amp;nbsp;srun&amp;nbsp;command.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Check Network Configuration: Ensure that the network interfaces on your nodes are properly configured and accessible. The OFI provider might be looking for specific high-performance network interfaces (like InfiniBand or Omni-Path) that are not configured or available.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Intel MPI Configuration: Intel MPI can be configured to use different communication fabrics. You can try setting the&amp;nbsp;I_MPI_FABRICS&amp;nbsp;environment variable to use a different fabric. For example:&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; $export I_MPI_FABRICS=shm:ofi&amp;nbsp;&amp;nbsp;or&amp;nbsp;&amp;nbsp;&amp;nbsp;export I_MPI_FABRICS=shm:tcp&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Add this line to your Slurm job script before the&amp;nbsp;mpirun&amp;nbsp;or&amp;nbsp;srun&amp;nbsp;command.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;You can get hint at here (&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/ofi-capable-network-fabrics-control.html" target="_blank" rel="nofollow noopener noreferrer"&gt;https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-13/ofi-capable-network-fabrics-control.html&lt;/A&gt;&amp;nbsp;)&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Thu, 24 Oct 2024 03:58:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638946#M11953</guid>
      <dc:creator>taehunkim</dc:creator>
      <dc:date>2024-10-24T03:58:23Z</dc:date>
    </item>
    <item>
      <title>Re: MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638974#M11954</link>
      <description>&lt;P&gt;Hi!&lt;BR /&gt;&lt;BR /&gt;Shortly, the problem occurs only with `I_MPI_FABRICS=ofi`. Looks like it is a problem with choosing some default provider, when it is not available.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;On my machine, I have the following fi providers:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;$ fi_info | grep provider | sort -u
provider: mlx
provider: psm3
provider: shm
provider: tcp
provider: tcp;ofi_rxm
provider: verbs
provider: verbs;ofi_rxm&lt;/LI-CODE&gt;&lt;P&gt;I checked all of them, and all of them works fine &lt;LI-EMOJI id="lia_slightly-smiling-face" title=":slightly_smiling_face:"&gt;&lt;/LI-EMOJI&gt;&lt;BR /&gt;&lt;BR /&gt;Then, I just set `I_MPI_FABRICS=54543`, and I saw the following message:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;MPI startup(): 54543 fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead.&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;As you can see, I do not have `ofi` (and `shm:ofi`) provider, but MPI suggests to use it.&lt;BR /&gt;&lt;BR /&gt;So, I set `I_MPI_FABRICS=ofi` and then I saw my error with RDMA_Write. At the same time, `I_MPI_FABRICS=shm:ofi` works fine &lt;LI-EMOJI id="lia_slightly-smiling-face" title=":slightly_smiling_face:"&gt;&lt;/LI-EMOJI&gt;&lt;BR /&gt;&lt;BR /&gt;Igor&lt;/P&gt;</description>
      <pubDate>Thu, 24 Oct 2024 07:33:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1638974#M11954</guid>
      <dc:creator>foxtran</dc:creator>
      <dc:date>2024-10-24T07:33:43Z</dc:date>
    </item>
    <item>
      <title>Re: MPI_Put: RDMA_WRITE error which MPI_Accumulate works fine</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1641346#M11978</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/304753"&gt;@foxtran&lt;/a&gt;&amp;nbsp;OFI is not the provider, OFI provides the providers:)&lt;BR /&gt;&lt;BR /&gt;Are you running on an AMD platform?&lt;/P&gt;</description>
      <pubDate>Tue, 05 Nov 2024 12:28:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Put-RDMA-WRITE-error-which-MPI-Accumulate-works-fine/m-p/1641346#M11978</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2024-11-05T12:28:08Z</dc:date>
    </item>
  </channel>
</rss>

