Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

How does Intel MPI handle network failures

dludick
Beginner
298 Views
Hi all,

I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).

What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after a number of (set) seconds?

Any help would be very much appreciated!
0 Kudos
1 Reply
TimP
Honored Contributor III
298 Views
This question is probably too involved to handle on this section of the forums. I believe built-in checkpoint capability for Intel MPI is under consideration but would be more than a year away. Many applications have their own recovery options. You might ask on the HPC forum, if any additional fault tolerance features are expected in the near term.
0 Kudos
Reply