Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2153 Discussions

Node shuts down in the middle of MPI process running

tamilalagan
Novice
954 Views

Hello All,

I have a cluster of 13 nodes running.

While running my jobs on the cluster, one my node shuts down.

I would expect the MPI process running on the entire cluster should be killed. Whereas the rest of process are running fine, since the data pipeline is broken, the nodes are running out of memory.

Intel MPI Version: Intel MPI 5.1.2.

Expected Behavior: If one node shuts down, all the MPI process running should go down.

Actual Behavior: The other MPI process are running fine.

 

Note: If I kill my MPI process on any node, the rest of the process in the cluster is killed.

0 Kudos
6 Replies
AbhishekD_Intel
Moderator
940 Views

Hi,

 

Ideally, your execution on all nodes should get killed, but it is not happening on your side.

So please check the value of environment variable I_MPI_FAULT_CONTINUE, if its value is ON then your program execution will continue to execute even if one of your nodes fails.


Also, please update your MPI version there are lots of problems which has been fixed in the latest release of Intel MPI.

And let us know if you are getting this same error using the latest MPI release.



Warm Regards,

Abhishek


0 Kudos
tamilalagan
Novice
935 Views

Hello,

The "I_MPI_FAULT_CONTINUE"  is not set. And this will be set as disabled by default. 

Is there any open issue with Intel MPI 5.1.2.

0 Kudos
AbhishekD_Intel
Moderator
912 Views

Hi,

Please update to the latest Intel MPI ( 2019 Update 8 ). We won't be able to help you much with the MPI version you are using, as there are lots of fixes after 5.1.2

 

 

Thank You,

Abhishek

 

0 Kudos
AbhishekD_Intel
Moderator
874 Views

Hi,


Please give us an update on your issue with the latest MPI versions.



Warm Regards,

Abhishek


0 Kudos
AbhishekD_Intel
Moderator
853 Views

We are assuming that the solution provided helped and would no longer be monitoring this issue. Please raise a new thread if you have further issues.


Thank you


0 Kudos
tamilalagan
Novice
785 Views

I was looking for the concepts followed in Intel 5.1 MPI version on handling the process terminations during node failure. 

Even if there is no support, understanding the principle, will help us to make correct decision.

0 Kudos
Reply