Hi James,

John_Gilmore · ‎01-11-2013

Hi,

I would really appreciate some help. I would like to know whether Intel MPI supports fault tolerance (run-through stabilisation) for multiple programs multiple data (MPMD) applications?

I have read the Intel MPI fault tolerance documentation. I am running a master - worker application, where the master and worker code are seperate and where there is no communication amongst workers. My configure command looks like this:

mpirun -perhost 10 -f /home/john/Application/src/hostfile_intel \
-n 1 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Master : \
-n 9 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Worker

Does MPI support this type of fault tolerance in terms of run-through stabilisation? I don't want the MPI job to crash, if a single process crashes. Currently, it doesn't seem to be working. If I kill a process, the complete MPI job terminates with the error:
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)

You help will be appreciated.

James_T_Intel · ‎01-11-2013

Hi John, Have you set the error handler to MPI_ERRORS_RETURN in your program? Are you handling errors within your program appropriately to insure that communications with a failed worker do not continue? Sincerely, James Tullos Technical Consulting Engineer Intel® Cluster Tools

John_Gilmore · ‎01-13-2013

Hi James,

Yes, right after calling MPI_Init, I set the error handler. I'm not sure what you mean with "appropriatly" handling errors. Currently, whenever I perform a send or receive, I have the following piece of code:

err =MPI_Recv(data, BUFFER_SIZE, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Error_class(err, &err_class);
if(err_class != MPI_SUCCESS)
{
MPI_Error_string(err, err_str, &err_len);
printf("Receive error %d: %s\n", err_class, err_str);fflush(stdout);
}

So I just print the error if there is one. I never see this error printout before the MPI job fails. After a receive gives an error, it is possible for my application to call the same function again, but shouldn't that just also return with an error?

Also, is it possible to reuse MPI_ANY_SOURCE after a process in MPI_COMM_WORLD has failed?

Your help is greatly appreciated!
John

James_T_Intel · ‎01-15-2013

Hi John,

It looks like you are doing what you need to be doing. I'll see if I can reproduce the behavior here and let you know the results.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

James_T_Intel · ‎01-15-2013

Hi John,

Can you please send a reproducer program? I am unable to reproduce this behavior.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

MPI MPMD fault tolerance support