Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2229 Discussions

Does mpirun kill the remaining ranks on other nodes when core dumps occurred in some rank?

mpi_new_user
New Contributor II
3,174 Views

Hello,

I want to use Intel Distribution for GDB to debug the core dump files. And when I use mpirun to run my program on multiple nodes with generating the core dump files ,the result confuses me.

Sometimes, the core dump files were generated correctly.

 

Other times, the core dump files were only generated on c2 node,and the ranks on c1 were killed without  core dump files, as shown in the figure below.

mpi_new_user_1-1652782541511.png

 

So, does mpirun have the machanism to kill other ranks or other processes when core dump is happening, like Open MPI ?

 Thanks.

 

 

 

 

0 Kudos
13 Replies
mpi_new_user
New Contributor II
3,171 Views

mpi_new_user_0-1652783015099.png

 

Add the figure for  "Sometimes, the core dump files were generated correctly."

0 Kudos
SantoshY_Intel
Moderator
3,139 Views

Hi,


Thank you for posting in Intel Communities.


Could you please provide the below details to investigate more on your issue?

  1. OS details
  2. The Intel MPI library version you are using.
  3. Sample reproducer code to try reproducing your issue from our end.
  4. Did you get any dump files generated after running the program using I_MPI_DEBUG_COREDUMP=1?


Thanks & Regards,

Santosh


0 Kudos
mpi_new_user
New Contributor II
3,115 Views

1. OS details

My OS is Centos 8.3.

2. The Intel MPI library version you are using.

The Intel MPI library version is 2021.5.

3. Sample reproducer code to try reproducing your issue from our end.

The code :

/* File: mpi_sum.c
* Compile as: mpicc -g -Wall -std=c99 -o mpi_sum mpi_sum.c -lm
* Description: An MPI solution to sum a 1D array. */
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
#include <time.h>

int main(int argc, char *argv[]) {
int myID, numProcs; // myID for the index to know when should the cpu start and stop calculate
//numPro numper of cpu you need to do the calculation
double localSum; // this for one operation on one cpu
double parallelSum; // this for collecting the values of localsum
int length = 10000000; // this for how many num
double Fact = 1 ;
int i; // this for for loop
clock_t clockStart, clockEnd; // timer
srand(5); // Initialize MPI
MPI_Init(NULL, NULL); //Initialize MPI
MPI_Comm_size(MPI_COMM_WORLD, &numProcs); // Get size
MPI_Comm_rank(MPI_COMM_WORLD, &myID); // Get rank
localSum = 0.0; // the value for eash cpu is 0
int A = (length / numProcs)*((long)myID); // this is to make each cpu work on his area
int B = (length / numProcs)*((long)myID + 1); // this is to make each cpu work on his area

A ++; // add 1 to go to next num
B ++;

clockStart = clock(); // start the timer to see how much time it take
for (i = A; i < B; i++)
{
Fact = (1 / myID - 1/numProcs) / (1 - 1/numProcs);
localSum += Fact ;
}

MPI_Reduce(&localSum, &parallelSum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

clockEnd = clock();

if (myID == 0)
{
printf("Time to sum %d floats with MPI in parallel %3.5f seconds\n", length, (clockEnd - clockStart) / (float)CLOCKS_PER_SEC);
printf("The parallel sum: %f\n", parallelSum + 1);
}

MPI_Finalize();
return 0;
}

 

4. Did you get any dump files generated after running the program using I_MPI_DEBUG_COREDUMP=1?

Yes,I got a core dump file after using I_MPI_DEBUG_COREDUMP=1. But, when I run the program on multiple nodes , only one node can generate only one core dump file , other ranks didn't get any core dump files.

mpi_new_user_0-1652924905132.png

 

 

 

0 Kudos
mpi_new_user
New Contributor II
3,082 Views

Hello,

I retest it and run the program(mpi_sum) by using OpenMPI.

It is correct and generates multiple core dump files (each rank generates one core dum file ).

So,could you check it  if Intel MPI has the limit to generate the core dump file?

Thanks .

0 Kudos
SantoshY_Intel
Moderator
3,066 Views

Hi,

 

We were able to reproduce your issue from our end using the Intel MPI Library 2021.6 on a Ubuntu 18.04 machine as shown in the below screenshot:

Screenshot 2022-05-20 181200.png

We will check internally if Intel MPI has any limitations to generate the core dump file and we will get back to you soon.

 

Meanwhile, could you please provide the steps(compilation/execution commands) you followed for testing using OpenMPI? 

 

 

Thanks & Regards,

Santosh

 

 

 

0 Kudos
SantoshY_Intel
Moderator
3,020 Views

Hi,


We haven't heard back from you.


>>"I retest it and run the program(mpi_sum) by using OpenMPI."

Could you please provide the steps that you followed to test your application using OpenMPI for generating core dump files for each rank?


Thanks & Regards,

Santosh


0 Kudos
mpi_new_user
New Contributor II
2,993 Views

Hi,

Sorry for my late reply.

The compilation is the same for OpenMPI.

And the command is :

mpiexec -np 4 --host c1,c2 ./mpi_sum 

 After running the command ,it will generated four core dump files in the directory.

 

Thanks.

 

0 Kudos
SantoshY_Intel
Moderator
2,989 Views

Hi,

 

We have tested your sample program using OpenMPI, but we were not able to get the core dump files.

Please refer to the screenshot below for the steps that we followed:

Screenshot 2022-05-30 143811.png

 

However, we tried another way of implementing it.

  1. Compile the MPI code using Intel MPI Library's mpiicc. 
    mpiicc sample.c​
  2. Now, run the executable using mpirun/mpiexec of the OpenMPI runtime environment.
    I_MPI_DEBUG_COREDUMP=1 mpirun -n 2 ./a.out​
  3. We can get the core dump files for each rank.

 

Please refer to the below screenshot for the steps that we followed:

SantoshY_Intel_0-1653903551280.png

working.png

 

Thanks & Regards,

Santosh

 

0 Kudos
mpi_new_user
New Contributor II
2,977 Views

Thanks for telling me another way to implement it and OpenMPI does generate the core dump files for each rank.

So, If you get the answer to the limitations of Intel MPI to generate the core dump files ,please let me know.

Thank you very much.

0 Kudos
SantoshY_Intel
Moderator
2,960 Views

Hi,

 

As you can see from the screenshot below when one of the processes is being terminated, it would kill the other processes passively i.e Intel MPI has the mechanism to kill other ranks or other processes when a core dump is happening.

MicrosoftTeams-image.png

 

Thanks & Regards,

Santosh

 

 

Thanks & Regards,

Santosh

 

 

0 Kudos
mpi_new_user
New Contributor II
2,940 Views

Hi,

Thank you for your reply and it does help me a lot.

 

 

 

 

 

 

0 Kudos
SantoshY_Intel
Moderator
2,918 Views

Hi,


Could you please confirm whether we can close this issue?


Thanks & Regards,

Santosh



0 Kudos
SantoshY_Intel
Moderator
2,888 Views

Hi,


We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Santosh


0 Kudos
Reply