I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.
I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?
There has been quite a lot of discussion on the OpenMPI Developer site on this topic. The concensus is: It is better to fail the entire mpiexec than deal with indeterminate behavior. Recall that some behavior (like sm) will affect results on any and all other procs.