Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI Reduce Hangs

AllenBarnett
Beginner
726 Views

Hi:

I have the latest oneAPI hpckit (2024.2.1) installed on two machines running Pop!_OS (which is some form of Ubuntu 22.04 LTS). This C++ program:

#include <array>
#include <mpi.h>

std::array<char,MPI_MAX_PROCESSOR_NAME> host;
int host_len{ host.size() };
int rank{0};
int contribution, total;

int main( int argc, char* argv[] )
{
MPI_Init( &argc, &argv );
MPI_Get_processor_name( host.data(), &host_len );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
printf( "Rank %3d on host %s\n", rank, host.data() );
contribution = rank;
MPI_Reduce( &contribution, &total, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD );
if ( rank == 0 ) {
printf( "Sum: %d\n", total );
}
MPI_Barrier( MPI_COMM_WORLD );
MPI_Finalize();
return 0;
}

hangs in the MPI_Reduce when the number of processes exceeds a certain size. For example:

mpirun -np 32 -ppn 16 -host node0,node1 ./example

 works fine. But

 

mpirun -np 64 -ppn 32 -host node0,node1 ./example

 

hangs with 100% CPU utilization of all 64 processes on both nodes.

 

I tried this program with OpenMPI 4.1.2 and it appears to work correctly for all -np values.

 

 

 

How can I diagnose this issue?

 

Thanks,

Allen

Labels (1)
0 Kudos
5 Replies
TobiasK
Moderator
717 Views

hi @AllenBarnett 

strictly speaking you are running an unsupported OS.

 

You may find some hints:

can you execute the IMB-MPI1 benchmarks that we ship? Please post the output of



I_MPI_DEBUG=10 I_MPI_HYDRA_DEBUG=1 mpirun -host node0,node1 -np 64 -ppn 32 IMB-MPI1

  

0 Kudos
AllenBarnett
Beginner
683 Views

Hi @TobiasK : I can run IMB-MPI1 on either machine and it works OK. But, when I run across both machines, it appears to hang even before reaching the first test, even with just a couple of processes. See attached which was just "-np 4 -ppn 2".

The processes are running at 100% cpu. I had to Ctl-C to stop it.

What OSes are officially supported?

Thanks,
Allen

0 Kudos
Kevin_McGrattan
475 Views

I have a similar issue. I have oneAPI/2024.2 with Intel MPI 2021.13 installed on a new linux cluster running Red Hat 9. We've had a problem with large jobs failing, and most often the point of failure is an MPI_ALLREDUCE. I created a smaller 4 process test case which I run across four 64 core nodes, launching 64 identical jobs simultaneously. I add print statements before and after the MPI call. I generally see 5 to 10 jobs out of the 64 hang. The print statement indicate that all 4 processes make the call to MPI_ALLREDUCE, but only 1, 2, or 3 return. This does not happen right away. These jobs can run thousands of iterations successfully before the hang. The failures do not occur if all 4 processes are assigned to the same node. If I split the job across two or four nodes, the failures occur every time I run the test.

I did a test where a broke the ALLREDUCE into a REDUCE + BCAST. The result is the same, but I see that the root process of the REDUCE call does not return when the failure occurs. 

I ran this test with both MPICH and Open MPI and these tests were successful. 

0 Kudos
AllenBarnett
Beginner
457 Views

Hi @Kevin_McGrattan : I don't have anything to add. I've been working on other things. Thanks for the confirmation, though

0 Kudos
TobiasK
Moderator
286 Views

@AllenBarnett  please check if this is the correct NIC:

enp68s0

that should be used by Intel MPI.

 

The pinning output on jaguar seems to be corrupted. What kind of platform do you use?

[0] MPI startup(): 0       0          enp68s0
[0] MPI startup(): 1       0          enp68s0
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       404457   tapir      {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,32,33,34,35,36,37,38,39,40,41,42,43,44,45,
                                 46,47}
[0] MPI startup(): 1       404458   tapir      {16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,48,49,50,51,52,53,54,55,56,57,58
                                 ,59,60,61,62,63}
[0] MPI startup(): 2       823415   jaguar     {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[0] MPI startup(): 3       -224450721            {0,1,2,5,6,8,10,18,19,20,21,22,23,24,28,29,30,34,35,36,37,38,39,40,41,42,43,44,45
                                 ,48,50,51,52,53,54,55,56,57,59,60,61,62,63}

 Passwordless ssh is enabled?

 

You may retry with the latest 2021.14 release.
https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html

 

0 Kudos
Reply