<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: MPI Reduce Hangs in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633830#M11927</link>
    <description>&lt;P&gt;hi&amp;nbsp;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/387098"&gt;@AllenBarnett&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;strictly speaking you are running an unsupported OS.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may find some hints:&lt;BR /&gt;&lt;BR /&gt;can you execute the IMB-MPI1 benchmarks that we ship? Please post the output of&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;I_MPI_DEBUG=10 I_MPI_HYDRA_DEBUG=1 mpirun -host node0,node1 -np 64 -ppn 32 IMB-MPI1&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 27 Sep 2024 13:54:59 GMT</pubDate>
    <dc:creator>TobiasK</dc:creator>
    <dc:date>2024-09-27T13:54:59Z</dc:date>
    <item>
      <title>MPI Reduce Hangs</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633828#M11926</link>
      <description>&lt;P&gt;Hi:&lt;/P&gt;&lt;P&gt;I have the latest oneAPI hpckit (2024.2.1) installed on two machines running Pop!_OS (which is some form of Ubuntu 22.04 LTS). This C++ program:&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;array&amp;gt;
#include &amp;lt;mpi.h&amp;gt;

std::array&amp;lt;char,MPI_MAX_PROCESSOR_NAME&amp;gt; host;
int host_len{ host.size() };
int rank{0};
int contribution, total;

int main( int argc, char* argv[] )
{
MPI_Init( &amp;amp;argc, &amp;amp;argv );
MPI_Get_processor_name( host.data(), &amp;amp;host_len );
MPI_Comm_rank( MPI_COMM_WORLD, &amp;amp;rank );
printf( "Rank %3d on host %s\n", rank, host.data() );
contribution = rank;
MPI_Reduce( &amp;amp;contribution, &amp;amp;total, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD );
if ( rank == 0 ) {
printf( "Sum: %d\n", total );
}
MPI_Barrier( MPI_COMM_WORLD );
MPI_Finalize();
return 0;
}&lt;/LI-CODE&gt;&lt;P&gt;hangs in the MPI_Reduce when the number of processes exceeds a certain size. For example:&lt;/P&gt;&lt;LI-CODE lang="bash"&gt;mpirun -np 32 -ppn 16 -host node0,node1 ./example&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;works fine. But&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="bash"&gt;mpirun -np 64 -ppn 32 -host node0,node1 ./example&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hangs with 100% CPU utilization of all 64 processes on both nodes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried this program with OpenMPI 4.1.2 and it appears to work correctly for all -np values.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can I diagnose this issue?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Allen&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2024 13:46:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633828#M11926</guid>
      <dc:creator>AllenBarnett</dc:creator>
      <dc:date>2024-09-27T13:46:27Z</dc:date>
    </item>
    <item>
      <title>Re: MPI Reduce Hangs</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633830#M11927</link>
      <description>&lt;P&gt;hi&amp;nbsp;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/387098"&gt;@AllenBarnett&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;strictly speaking you are running an unsupported OS.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may find some hints:&lt;BR /&gt;&lt;BR /&gt;can you execute the IMB-MPI1 benchmarks that we ship? Please post the output of&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;I_MPI_DEBUG=10 I_MPI_HYDRA_DEBUG=1 mpirun -host node0,node1 -np 64 -ppn 32 IMB-MPI1&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2024 13:54:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633830#M11927</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2024-09-27T13:54:59Z</dc:date>
    </item>
    <item>
      <title>Re: MPI Reduce Hangs</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633840#M11928</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/245425"&gt;@TobiasK&lt;/a&gt; : I can run IMB-MPI1 on either machine and it works OK. But, when I run across both machines, it appears to hang even before reaching the first test, even with just a couple of processes. See attached which was just "-np 4 -ppn 2".&lt;/P&gt;&lt;P&gt;The processes are running at 100% cpu. I had to Ctl-C to stop it.&lt;/P&gt;&lt;P&gt;What OSes are officially supported?&lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;Allen&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2024 14:21:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1633840#M11928</guid>
      <dc:creator>AllenBarnett</dc:creator>
      <dc:date>2024-09-27T14:21:30Z</dc:date>
    </item>
    <item>
      <title>Re: MPI Reduce Hangs</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1637894#M11941</link>
      <description>&lt;P&gt;I have a similar issue. I have oneAPI/2024.2 with Intel MPI 2021.13 installed on a new linux cluster running Red Hat 9. We've had a problem with large jobs failing, and most often the point of failure is an MPI_ALLREDUCE. I created a smaller 4 process test case which I run across four 64 core nodes, launching 64 identical jobs simultaneously. I add print statements before and after the MPI call. I generally see 5 to 10 jobs out of the 64 hang. The print statement indicate that all 4 processes make the call to MPI_ALLREDUCE, but only 1, 2, or 3 return. This does not happen right away. These jobs can run thousands of iterations successfully before the hang. The failures do not occur if all 4 processes are assigned to the same node. If I split the job across two or four nodes, the failures occur every time I run the test.&lt;/P&gt;&lt;P&gt;I did a test where a broke the ALLREDUCE into a REDUCE + BCAST. The result is the same, but I see that the root process of the REDUCE call does not return when the failure occurs.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I ran this test with both MPICH and Open MPI and these tests were successful.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Oct 2024 20:04:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1637894#M11941</guid>
      <dc:creator>Kevin_McGrattan</dc:creator>
      <dc:date>2024-10-17T20:04:41Z</dc:date>
    </item>
    <item>
      <title>Re: MPI Reduce Hangs</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1637895#M11942</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/78902"&gt;@Kevin_McGrattan&lt;/a&gt; : I don't have anything to add. I've been working on other things. Thanks for the confirmation, though &lt;LI-EMOJI id="lia_slightly-smiling-face" title=":slightly_smiling_face:"&gt;&lt;/LI-EMOJI&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Oct 2024 20:25:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1637895#M11942</guid>
      <dc:creator>AllenBarnett</dc:creator>
      <dc:date>2024-10-17T20:25:19Z</dc:date>
    </item>
    <item>
      <title>Re: MPI Reduce Hangs</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1641327#M11973</link>
      <description>&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/387098"&gt;@AllenBarnett&lt;/a&gt;&amp;nbsp; please check if this is the correct NIC:&lt;BR /&gt;&lt;BR /&gt;&lt;CODE class="bp-text-code txt"&gt;enp68s0&lt;/CODE&gt;&lt;/P&gt;
&lt;P&gt;that should be used by Intel MPI.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The pinning output on jaguar seems to be corrupted. What kind of platform do you use?&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;PRE class="bp-text bp-text-plain hljs bp-is-scrollable" tabindex="0"&gt;&lt;CODE class="bp-text-code txt"&gt;[0] MPI startup(): 0       0          enp68s0
[0] MPI startup(): 1       0          enp68s0
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       404457   tapir      {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,32,33,34,35,36,37,38,39,40,41,42,43,44,45,
                                 46,47}
[0] MPI startup(): 1       404458   tapir      {16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,48,49,50,51,52,53,54,55,56,57,58
                                 ,59,60,61,62,63}
[0] MPI startup(): 2       823415   jaguar     {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[0] MPI startup(): 3       -224450721            {0,1,2,5,6,8,10,18,19,20,21,22,23,24,28,29,30,34,35,36,37,38,39,40,41,42,43,44,45
                                 ,48,50,51,52,53,54,55,56,57,59,60,61,62,63}&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;Passwordless ssh is enabled?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may retry with the latest 2021.14 release.&lt;BR /&gt;&lt;A href="https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html" target="_blank"&gt;https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Nov 2024 10:56:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/MPI-Reduce-Hangs/m-p/1641327#M11973</guid>
      <dc:creator>TobiasK</dc:creator>
      <dc:date>2024-11-05T10:56:05Z</dc:date>
    </item>
  </channel>
</rss>

