Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

fault tolerance in MPI programs

jackyjngwn
Beginner
322 Views
Hi,

I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.

I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?

Thanks!
0 Kudos
1 Reply
kalloyd
Beginner
322 Views
There has been quite a lot of discussion on the OpenMPI Developer site on this topic. The concensus is: It is better to fail the entire mpiexec than deal with indeterminate behavior. Recall that some behavior (like sm) will affect results on any and all other procs.
0 Kudos
Reply