- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).
What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after a number of (set) seconds?
Any help would be very much appreciated!
I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).
What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after a number of (set) seconds?
Any help would be very much appreciated!
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This question is probably too involved to handle on this section of the forums. I believe built-in checkpoint capability for Intel MPI is under consideration but would be more than a year away. Many applications have their own recovery options. You might ask on the HPC forum, if any additional fault tolerance features are expected in the near term.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page