- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to use MPI_Errhandler_set in the communicator from a process created by MPI_Comm_spawn. I would like to detect process failure and do something about it.
My testing code is:
------------------
#include
#include
#include
int main(int argc, char ** argv){
MPI_Comm comm_parent, intercomm;
int err, errRecv;
int v = 0;
MPI_Status status;
MPI_Info info;
MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&comm_parent);
if(comm_parent == MPI_COMM_NULL){
MPI_Info_create(&info);
MPI_Info_set(info, "host", "192.168.0.2");
printf("Parent creates child...\\n");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 1, info, 0, MPI_COMM_SELF, &intercomm, &err);
MPI_Errhandler_set(intercomm, MPI_ERRORS_RETURN);
printf("Waiting...\\n");
errRecv = MPI_Recv(&v, 1, MPI_INT, 0, 0, intercomm, &status);
if(errRecv != MPI_SUCCESS){
printf("Error detected!\\n");
fflush(stdout);
}
}
else{
sleep(60);
MPI_Send(&v, 1, MPI_INT, 0, 0, comm_parent);
}
printf("Finalize\\n");
MPI_Finalize();
return(0);
}
------------------
I typed in a terminal:
$ export I_MPI_FAULT_CONTINUE=on
$ mpicc test.c -o test -Wall
$ mpirun -np 1 ./test 1
In another terminal, I killed child process and parent process stoped without printing the following messages (from printf). The output is only:
$ mpirun -np 1 ./test 1
Parent creates child...
Waiting...
$
I was expecting that the program to continue and print "Error detected!" and "Finalize".
Why doesn't it happen?
Thanks,
Fernanda Oliveira
I'm trying to use MPI_Errhandler_set in the communicator from a process created by MPI_Comm_spawn. I would like to detect process failure and do something about it.
My testing code is:
------------------
#include
#include
#include
int main(int argc, char ** argv){
MPI_Comm comm_parent, intercomm;
int err, errRecv;
int v = 0;
MPI_Status status;
MPI_Info info;
MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&comm_parent);
if(comm_parent == MPI_COMM_NULL){
MPI_Info_create(&info);
MPI_Info_set(info, "host", "192.168.0.2");
printf("Parent creates child...\\n");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 1, info, 0, MPI_COMM_SELF, &intercomm, &err);
MPI_Errhandler_set(intercomm, MPI_ERRORS_RETURN);
printf("Waiting...\\n");
errRecv = MPI_Recv(&v, 1, MPI_INT, 0, 0, intercomm, &status);
if(errRecv != MPI_SUCCESS){
printf("Error detected!\\n");
fflush(stdout);
}
}
else{
sleep(60);
MPI_Send(&v, 1, MPI_INT, 0, 0, comm_parent);
}
printf("Finalize\\n");
MPI_Finalize();
return(0);
}
------------------
I typed in a terminal:
$ export I_MPI_FAULT_CONTINUE=on
$ mpicc test.c -o test -Wall
$ mpirun -np 1 ./test 1
In another terminal, I killed child process and parent process stoped without printing the following messages (from printf). The output is only:
$ mpirun -np 1 ./test 1
Parent creates child...
Waiting...
$
I was expecting that the program to continue and print "Error detected!" and "Finalize".
Why doesn't it happen?
Thanks,
Fernanda Oliveira
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Fernanda,
According to the documentation, Fault Tolerance works only for master-slave processes and only for processes which rank is not 0. Also you need to set Errhandler for MPI_COMM_WORLD. Might be it's not obvious but it means that fault tolerance feature won't work for spawn processes (you can see that both processes in your case have rank 0).
You just need to modify your program:
MPI_Init(&argc, &argv);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
And start 2 processes: mpirun -np 2 ...
So, you need also change
if(comm_parent == MPI_COMM_NULL){
to
if(rank == 0){
Working with spawned processes is very difficult task and I'd recommend avoiding this scheme of MPI programming.
Regards!
Dmitry
According to the documentation, Fault Tolerance works only for master-slave processes and only for processes which rank is not 0. Also you need to set Errhandler for MPI_COMM_WORLD. Might be it's not obvious but it means that fault tolerance feature won't work for spawn processes (you can see that both processes in your case have rank 0).
You just need to modify your program:
MPI_Init(&argc, &argv);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
And start 2 processes: mpirun -np 2 ...
So, you need also change
if(comm_parent == MPI_COMM_NULL){
to
if(rank == 0){
Working with spawned processes is very difficult task and I'd recommend avoiding this scheme of MPI programming.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our testing has shown that in addition to the restrictions on fault tolerance mentioned in the MPI reference guide, it also only works when the slave process send only and the master receives with MPI_WaitAny, on a vector if receive objects. When an error is received that object must be eliminated from the vector and not called again. All other scenarios we tried resulted in various system failures.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page