I would really appreciate some help. I would like to know whether Intel MPI supports fault tolerance (run-through stabilisation) for multiple programs multiple data (MPMD) applications?
I have read the Intel MPI fault tolerance documentation. I am running a master - worker application, where the master and worker code are seperate and where there is no communication amongst workers. My configure command looks like this:
mpirun -perhost 10 -f /home/john/Application/src/hostfile_intel \
-n 1 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Master : \
-n 9 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Worker
Does MPI support this type of fault tolerance in terms of run-through stabilisation? I don't want the MPI job to crash, if a single process crashes. Currently, it doesn't seem to be working. If I kill a process, the complete MPI job terminates with the error:
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
You help will be appreciated.
Yes, right after calling MPI_Init, I set the error handler. I'm not sure what you mean with "appropriatly" handling errors. Currently, whenever I perform a send or receive, I have the following piece of code:
err =MPI_Recv(data, BUFFER_SIZE, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if(err_class != MPI_SUCCESS)
MPI_Error_string(err, err_str, &err_len);
printf("Receive error %d: %s\n", err_class, err_str);fflush(stdout);
So I just print the error if there is one. I never see this error printout before the MPI job fails. After a receive gives an error, it is possible for my application to call the same function again, but shouldn't that just also return with an error?
Also, is it possible to reuse MPI_ANY_SOURCE after a process in MPI_COMM_WORLD has failed?
Your help is greatly appreciated!
It looks like you are doing what you need to be doing. I'll see if I can reproduce the behavior here and let you know the results.
Technical Consulting Engineer
Intel® Cluster Tools