Fault-tolerant MPI program

jackyjngwn · ‎03-10-2011

Hi,

I am working on a fault-tolerant MPI program. My goal is that when one of the compute nodes fails due to network or hardware issue, the other nodes won't be affected.

I am now using MPI::ERRORS_THROW_EXCEPTIONS and MPI::Exceptions to catch MPI errors. To test my program, I kill one of the MPI processes during the execution. But then all the other processes abort too. Does this mean that my program is not handling MPI exceptions correctly? Or that I should use some other way to test the program?

Thanks!

Dmitry_K_Intel2 · ‎03-14-2011

Hi,

You probably need to set I_MPI_FAULT_CONTINUE=on (if you are using Intel MPI Library). Intel MPI Library just aborts an application if one of the processes stops.

Please read chapter 5 of the Reference Manual for more details.

Regards!
Dmitry

jackyjngwn · ‎03-14-2011

Thanks for the reply. I did read that chapter and followed the instructions, but still the program exited when one of the processes was killed. Below is what I have done:

1. set error handler MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
2. try { MPI function}
catch (MPI::Exception e){ print error string}

3. use -env I_MPI_FAULT_CONTINUE on for mpirun.

Thanks

Dmitry_K_Intel2 · ‎03-18-2011

Fault-tolerance has not been implemented for MPI::ERRORS_THROW_EXEPTIONS yet.
It works for errhandler MPI_ERRORS_RETURN only.

So, with existing implementation of the Intel MPI Library your application won't work.

Regards!
Dmitry

jackyjngwn · ‎03-21-2011

Then can I mix MPI_ERRORS_RETURN with c++? Thanks

Dmitry_K_Intel2 · ‎03-22-2011

It's hardly possible to use MPI_ERRORS_RETURN with c++, at least with Intel MPI Library.
Take a look at the presentation.

Might be other implementations work better.

Regards!
Dmitry

matheusbersot · ‎04-22-2011

jackyjngwn,

As you, I'm working on a fault-tolerant MPI Program. These things that you have put about MPI_ERROR_RETURN doesn't work if a fail occurs. What you have to do is an fault-tolerant mechanism in your application, e.g, a failure detector and recovery. I did it, but I have a problem to clean all messages in communication channels and to recovery the application.

Suppose this situation: You have 2 nodes A and B. A sends messages to B. Besides, B failed in some time and a message from A to B was already sent. In this case, TCP will try to retransmit this message until a certain number of times, 15, by default. I had to increase this number, because after 15 times is reached, TCP gives an error and pass it to MPI layer. I noted that Intel MPI aborts application. This behaviour doesn't repeat with Open MPI.

If you need more details, ask me. I know that what I put don't help you a lot but I'm trying to show you the challenges that you will face.

Hugs,
matheusbersot.