Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Detecting process failure using Intel MPI

fgpassos
Beginner
645 Views
Hi,

I'm trying to use MPI_Errhandler_set in the communicator from a process created by MPI_Comm_spawn. I would like to detect process failure and do something about it.

My testing code is:

------------------
#include
#include
#include

int main(int argc, char ** argv){

MPI_Comm comm_parent, intercomm;
int err, errRecv;
int v = 0;
MPI_Status status;
MPI_Info info;

MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&comm_parent);

if(comm_parent == MPI_COMM_NULL){

MPI_Info_create(&info);
MPI_Info_set(info, "host", "192.168.0.2");

printf("Parent creates child...\\n");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 1, info, 0, MPI_COMM_SELF, &intercomm, &err);

MPI_Errhandler_set(intercomm, MPI_ERRORS_RETURN);

printf("Waiting...\\n");
errRecv = MPI_Recv(&v, 1, MPI_INT, 0, 0, intercomm, &status);

if(errRecv != MPI_SUCCESS){
printf("Error detected!\\n");
fflush(stdout);
}

}
else{

sleep(60);
MPI_Send(&v, 1, MPI_INT, 0, 0, comm_parent);

}

printf("Finalize\\n");
MPI_Finalize();
return(0);

}

------------------

I typed in a terminal:

$ export I_MPI_FAULT_CONTINUE=on
$ mpicc test.c -o test -Wall
$ mpirun -np 1 ./test 1


In another terminal, I killed child process and parent process stoped without printing the following messages (from printf). The output is only:

$ mpirun -np 1 ./test 1
Parent creates child...
Waiting...
$


I was expecting that the program to continue and print "Error detected!" and "Finalize".
Why doesn't it happen?

Thanks,
Fernanda Oliveira
0 Kudos
2 Replies
Dmitry_K_Intel2
Employee
645 Views
Hi Fernanda,

According to the documentation, Fault Tolerance works only for master-slave processes and only for processes which rank is not 0. Also you need to set Errhandler for MPI_COMM_WORLD. Might be it's not obvious but it means that fault tolerance feature won't work for spawn processes (you can see that both processes in your case have rank 0).
You just need to modify your program:
MPI_Init(&argc, &argv);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

And start 2 processes: mpirun -np 2 ...
So, you need also change
if(comm_parent == MPI_COMM_NULL){
to
if(rank == 0){

Working with spawned processes is very difficult task and I'd recommend avoiding this scheme of MPI programming.

Regards!
Dmitry
0 Kudos
David_M_13
Beginner
645 Views
Our testing has shown that in addition to the restrictions on fault tolerance mentioned in the MPI reference guide, it also only works when the slave process send only and the master receives with MPI_WaitAny, on a vector if receive objects. When an error is received that object must be eliminated from the vector and not called again. All other scenarios we tried resulted in various system failures.
0 Kudos
Reply