Community
cancel
Showing results for 
Search instead for 
Did you mean: 
jackyjngwn
Beginner
36 Views

fault tolerance in MPI programs

Hi,

I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.

I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?

Thanks!
0 Kudos
1 Reply
kalloyd
Beginner
36 Views

There has been quite a lot of discussion on the OpenMPI Developer site on this topic. The concensus is: It is better to fail the entire mpiexec than deal with indeterminate behavior. Recall that some behavior (like sm) will affect results on any and all other procs.
Reply