Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

MPI Reduce Hangs

AllenBarnett
Beginner
1,683 Views

Hi:

I have the latest oneAPI hpckit (2024.2.1) installed on two machines running Pop!_OS (which is some form of Ubuntu 22.04 LTS). This C++ program:

#include <array>
#include <mpi.h>

std::array<char,MPI_MAX_PROCESSOR_NAME> host;
int host_len{ host.size() };
int rank{0};
int contribution, total;

int main( int argc, char* argv[] )
{
MPI_Init( &argc, &argv );
MPI_Get_processor_name( host.data(), &host_len );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
printf( "Rank %3d on host %s\n", rank, host.data() );
contribution = rank;
MPI_Reduce( &contribution, &total, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD );
if ( rank == 0 ) {
printf( "Sum: %d\n", total );
}
MPI_Barrier( MPI_COMM_WORLD );
MPI_Finalize();
return 0;
}

hangs in the MPI_Reduce when the number of processes exceeds a certain size. For example:

mpirun -np 32 -ppn 16 -host node0,node1 ./example

 works fine. But

 

mpirun -np 64 -ppn 32 -host node0,node1 ./example

 

hangs with 100% CPU utilization of all 64 processes on both nodes.

 

I tried this program with OpenMPI 4.1.2 and it appears to work correctly for all -np values.

 

 

 

How can I diagnose this issue?

 

Thanks,

Allen

Labels (1)
0 Kudos
5 Replies
TobiasK
Moderator
1,674 Views

hi @AllenBarnett 

strictly speaking you are running an unsupported OS.

 

You may find some hints:

can you execute the IMB-MPI1 benchmarks that we ship? Please post the output of



I_MPI_DEBUG=10 I_MPI_HYDRA_DEBUG=1 mpirun -host node0,node1 -np 64 -ppn 32 IMB-MPI1

  

0 Kudos
AllenBarnett
Beginner
1,640 Views

Hi @TobiasK : I can run IMB-MPI1 on either machine and it works OK. But, when I run across both machines, it appears to hang even before reaching the first test, even with just a couple of processes. See attached which was just "-np 4 -ppn 2".

The processes are running at 100% cpu. I had to Ctl-C to stop it.

What OSes are officially supported?

Thanks,
Allen

0 Kudos
Kevin_McGrattan
1,432 Views

I have a similar issue. I have oneAPI/2024.2 with Intel MPI 2021.13 installed on a new linux cluster running Red Hat 9. We've had a problem with large jobs failing, and most often the point of failure is an MPI_ALLREDUCE. I created a smaller 4 process test case which I run across four 64 core nodes, launching 64 identical jobs simultaneously. I add print statements before and after the MPI call. I generally see 5 to 10 jobs out of the 64 hang. The print statement indicate that all 4 processes make the call to MPI_ALLREDUCE, but only 1, 2, or 3 return. This does not happen right away. These jobs can run thousands of iterations successfully before the hang. The failures do not occur if all 4 processes are assigned to the same node. If I split the job across two or four nodes, the failures occur every time I run the test.

I did a test where a broke the ALLREDUCE into a REDUCE + BCAST. The result is the same, but I see that the root process of the REDUCE call does not return when the failure occurs. 

I ran this test with both MPICH and Open MPI and these tests were successful. 

0 Kudos
AllenBarnett
Beginner
1,414 Views

Hi @Kevin_McGrattan : I don't have anything to add. I've been working on other things. Thanks for the confirmation, though

0 Kudos
TobiasK
Moderator
1,243 Views

@AllenBarnett  please check if this is the correct NIC:

enp68s0

that should be used by Intel MPI.

 

The pinning output on jaguar seems to be corrupted. What kind of platform do you use?

[0] MPI startup(): 0       0          enp68s0
[0] MPI startup(): 1       0          enp68s0
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       404457   tapir      {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,32,33,34,35,36,37,38,39,40,41,42,43,44,45,
                                 46,47}
[0] MPI startup(): 1       404458   tapir      {16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,48,49,50,51,52,53,54,55,56,57,58
                                 ,59,60,61,62,63}
[0] MPI startup(): 2       823415   jaguar     {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[0] MPI startup(): 3       -224450721            {0,1,2,5,6,8,10,18,19,20,21,22,23,24,28,29,30,34,35,36,37,38,39,40,41,42,43,44,45
                                 ,48,50,51,52,53,54,55,56,57,59,60,61,62,63}

 Passwordless ssh is enabled?

 

You may retry with the latest 2021.14 release.
https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html

 

0 Kudos
Reply