- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.
I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?
Thanks!
I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.
I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?
Thanks!
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There has been quite a lot of discussion on the OpenMPI Developer site on this topic. The concensus is: It is better to fail the entire mpiexec than deal with indeterminate behavior. Recall that some behavior (like sm) will affect results on any and all other procs.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page