MPI fault tolerance

Jimmy821 · ‎01-06-2009

I am relatively new to MPI programming.

I am wondering how I can start up each process manually within the same MPI communicator world space.

In addition, when only a single process fails, how can this be detected and relaunched automatically, without crashing the other processors and the host?

Hope somebody can advise on this two issues. Thanks

ClayB · ‎01-06-2009

Quoting - Jimmy82

I am wondering how I can start up each process manually within the same MPI communicator world space.

In addition, when only a single process fails, how can this be detected and relaunched automatically, without crashing the other processors and the host?

Jimmy82 -

It's been a while since I"ve done any serious MPI, but I've always started the entire set of processes at the outset of the computation (and they all start in the same communicator MPI_COMM_WORLD). I thought there was an MPI 2 API that allowed you to paunch additional processes from within an executing application, but I've never used it and really don't know if this is more than just a figment of my imagination.

Fault-tolerance is not part of the MPI standard. If a process dies it's dead. The programmer would need to put in some kind of logic to ping processes every so often to see if they are all still alive. If there is no reply, the undone computation can be reassigned. Of course, what is the difference between a slow connection and a dead process? How long does the process doing the pinging wait until it declares a process dead?

I'm a bit surprised by your questions. All of these issues seem pretty advance for someone starting out with MPI. Are you curious or do you have a use for these "features"?

--clay

ClayB · ‎01-06-2009

I pulled out my references. Jimmy82 is probably looking for MPI_COMM_SPAWN_MULTIPLE which puts the newly created processes in MPI_COMM_WORLD. There is also MPI_COMM_SPAWN, but this creates a new intercommunicator for the new processes.

I've been told that Intel MPI does not currently support dynamic process creation. So, double check with the MPI library that you are using to be sure the desired functionality is supported.

--clay

Jimmy821 · ‎01-06-2009

Quoting - Clay Breshears (Intel)

I pulled out my references. Jimmy82 is probably looking for MPI_COMM_SPAWN_MULTIPLE which puts the newly created processes in MPI_COMM_WORLD. There is also MPI_COMM_SPAWN, but this creates a new intercommunicator for the new processes.

I've been told that Intel MPI does not currently support dynamic process creation. So, double check with the MPI library that you are using to be sure the desired functionality is supported.

--clay

I have a need for such a functionality. This is to prevent my program from crashing when one of the processes (launched by MPI) terminates unexpectedly. Previously I have a project that uses MPI, but when one of the processes failed, the entire chain of application crashes. I want to avoid such cases.

Thanks for helping!