Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Fault Tolerance Question


Hello there,

I am trying to do some experiments with fault tolerance on MPI with FORTRAN, but I'm having troubles. I am calling the routine


which seems to work more or less. After calling, for instance, MPI_SENDRECV, the variable STATUS does not report any error, i.e. STATUS(MPI_ERROR) is always zero. The ierr integer may be nonzero though, and that's what I've been trying to catch instead.

Regardless what I've been trying to catch, some things still seem to trigger the termination of the whole thing. Things that successfully report an error without aborting the execution include receiving a larger buffer, sending a smaller buffer, sending to wrong communicator and so on.

But externally killing one of the processes, kills all of them. If segmentation fault occurs in one process, all of them are aborted. I am not sure if I can blame the MPI implementation here, because these do not happen inside any MPI call. I was simply expecting one processes to stop without interfering with the other processes; perhaps all of them would have troubles once they try to communicate since the dead process will not reply, but instead, SIGKILL takes all of them down.

Is there any way to kill one process without killing the other ones with current MPI implementations? Perhaps an enviroment variable, or some special command provided to mpirun/mpiexec?

I'm using ifort-15.0.3 (also tried 14), intel MPI 5.1.0b (also tried 5.0.0 and 5.0.3) on x86_64 Linux.

0 Kudos
0 Replies