I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).
What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after a number of (set) seconds?
This question is probably too involved to handle on this section of the forums. I believe built-in checkpoint capability for Intel MPI is under consideration but would be more than a year away. Many applications have their own recovery options. You might ask on the HPC forum, if any additional fault tolerance features are expected in the near term.