- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
I have a cluster of 13 nodes running.
While running my jobs on the cluster, one my node shuts down.
I would expect the MPI process running on the entire cluster should be killed. Whereas the rest of process are running fine, since the data pipeline is broken, the nodes are running out of memory.
Intel MPI Version: Intel MPI 5.1.2.
Expected Behavior: If one node shuts down, all the MPI process running should go down.
Actual Behavior: The other MPI process are running fine.
Note: If I kill my MPI process on any node, the rest of the process in the cluster is killed.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Ideally, your execution on all nodes should get killed, but it is not happening on your side.
So please check the value of environment variable I_MPI_FAULT_CONTINUE, if its value is ON then your program execution will continue to execute even if one of your nodes fails.
Also, please update your MPI version there are lots of problems which has been fixed in the latest release of Intel MPI.
And let us know if you are getting this same error using the latest MPI release.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The "I_MPI_FAULT_CONTINUE" is not set. And this will be set as disabled by default.
Is there any open issue with Intel MPI 5.1.2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please update to the latest Intel MPI ( 2019 Update 8 ). We won't be able to help you much with the MPI version you are using, as there are lots of fixes after 5.1.2
Thank You,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please give us an update on your issue with the latest MPI versions.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are assuming that the solution provided helped and would no longer be monitoring this issue. Please raise a new thread if you have further issues.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was looking for the concepts followed in Intel 5.1 MPI version on handling the process terminations during node failure.
Even if there is no support, understanding the principle, will help us to make correct decision.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page