- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
I have the latest oneAPI hpckit (2024.2.1) installed on two machines running Pop!_OS (which is some form of Ubuntu 22.04 LTS). This C++ program:
#include <array>
#include <mpi.h>
std::array<char,MPI_MAX_PROCESSOR_NAME> host;
int host_len{ host.size() };
int rank{0};
int contribution, total;
int main( int argc, char* argv[] )
{
MPI_Init( &argc, &argv );
MPI_Get_processor_name( host.data(), &host_len );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
printf( "Rank %3d on host %s\n", rank, host.data() );
contribution = rank;
MPI_Reduce( &contribution, &total, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD );
if ( rank == 0 ) {
printf( "Sum: %d\n", total );
}
MPI_Barrier( MPI_COMM_WORLD );
MPI_Finalize();
return 0;
}
hangs in the MPI_Reduce when the number of processes exceeds a certain size. For example:
mpirun -np 32 -ppn 16 -host node0,node1 ./example
works fine. But
mpirun -np 64 -ppn 32 -host node0,node1 ./example
hangs with 100% CPU utilization of all 64 processes on both nodes.
I tried this program with OpenMPI 4.1.2 and it appears to work correctly for all -np values.
How can I diagnose this issue?
Thanks,
Allen
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
strictly speaking you are running an unsupported OS.
You may find some hints:
can you execute the IMB-MPI1 benchmarks that we ship? Please post the output of
I_MPI_DEBUG=10 I_MPI_HYDRA_DEBUG=1 mpirun -host node0,node1 -np 64 -ppn 32 IMB-MPI1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @TobiasK : I can run IMB-MPI1 on either machine and it works OK. But, when I run across both machines, it appears to hang even before reaching the first test, even with just a couple of processes. See attached which was just "-np 4 -ppn 2".
The processes are running at 100% cpu. I had to Ctl-C to stop it.
What OSes are officially supported?
Thanks,
Allen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a similar issue. I have oneAPI/2024.2 with Intel MPI 2021.13 installed on a new linux cluster running Red Hat 9. We've had a problem with large jobs failing, and most often the point of failure is an MPI_ALLREDUCE. I created a smaller 4 process test case which I run across four 64 core nodes, launching 64 identical jobs simultaneously. I add print statements before and after the MPI call. I generally see 5 to 10 jobs out of the 64 hang. The print statement indicate that all 4 processes make the call to MPI_ALLREDUCE, but only 1, 2, or 3 return. This does not happen right away. These jobs can run thousands of iterations successfully before the hang. The failures do not occur if all 4 processes are assigned to the same node. If I split the job across two or four nodes, the failures occur every time I run the test.
I did a test where a broke the ALLREDUCE into a REDUCE + BCAST. The result is the same, but I see that the root process of the REDUCE call does not return when the failure occurs.
I ran this test with both MPICH and Open MPI and these tests were successful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Kevin_McGrattan : I don't have anything to add. I've been working on other things. Thanks for the confirmation, though
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@AllenBarnett please check if this is the correct NIC:enp68s0
that should be used by Intel MPI.
The pinning output on jaguar seems to be corrupted. What kind of platform do you use?
[0] MPI startup(): 0 0 enp68s0
[0] MPI startup(): 1 0 enp68s0
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 404457 tapir {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,32,33,34,35,36,37,38,39,40,41,42,43,44,45,
46,47}
[0] MPI startup(): 1 404458 tapir {16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,48,49,50,51,52,53,54,55,56,57,58
,59,60,61,62,63}
[0] MPI startup(): 2 823415 jaguar {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[0] MPI startup(): 3 -224450721 {0,1,2,5,6,8,10,18,19,20,21,22,23,24,28,29,30,34,35,36,37,38,39,40,41,42,43,44,45
,48,50,51,52,53,54,55,56,57,59,60,61,62,63}
Passwordless ssh is enabled?
You may retry with the latest 2021.14 release.
https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page