- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.
I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You probably need to set I_MPI_FAULT_CONTINUE=on (if you are using Intel MPI Library). Intel MPI Library just aborts an application if one of the processes stops.
Please read chapter 5 of the Reference Manual for more details.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply. I did read that chapter and followed the instructions, but still the program exited when one of the processes was killed. Below is what I have done:
1. set error handler MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
2. try { MPI function}
catch (MPI::Exception e){ print error string}
3. use -env I_MPI_FAULT_CONTINUE on for mpirun.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It works for errhandler MPI_ERRORS_RETURN only.
So, with existing implementation of the Intel MPI Library your application won't work.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Take a look at the presentation.
Might be other implementations work better.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you, I'm working on a fault-tolerant MPI Program. These things that you have put about MPI_ERROR_RETURN doesn't work if a fail occurs. What you have to do is an fault-tolerant mechanism in your application, e.g, a failure detector and recovery. I did it, but I have a problem to clean all messages in communication channels and to recovery the application.
Suppose this situation: You have 2 nodes A and B. A sends messages to B. Besides, B failed in some time and a message from A to B was already sent. In this case, TCP will try to retransmit this message until a certain number of times, 15, by default. I had to increase this number, because after 15 times is reached, TCP gives an error and pass it to MPI layer. I noted that Intel MPI aborts application. This behaviour doesn't repeat with Open MPI.
If you need more details, ask me. I know that what I put don't help you a lot but I'm trying to show you the challenges that you will face.
Hugs,
matheusbersot.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page