Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

How does Intel MPI handle network failures

dludick
Beginner
650 Views
Hi all,

I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).

What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after a number of (set) seconds?

Any help would be very much appreciated!
0 Kudos
1 Reply
TimP
Honored Contributor III
650 Views
This question is probably too involved to handle on this section of the forums. I believe built-in checkpoint capability for Intel MPI is under consideration but would be more than a year away. Many applications have their own recovery options. You might ask on the HPC forum, if any additional fault tolerance features are expected in the near term.
0 Kudos
Reply