Community
cancel
Showing results for 
Search instead for 
Did you mean: 
tamilalagan
Novice
277 Views

Node shuts down in the middle of MPI process running

Hello All,

I have a cluster of 13 nodes running.

While running my jobs on the cluster, one my node shuts down.

I would expect the MPI process running on the entire cluster should be killed. Whereas the rest of process are running fine, since the data pipeline is broken, the nodes are running out of memory.

Intel MPI Version: Intel MPI 5.1.2.

Expected Behavior: If one node shuts down, all the MPI process running should go down.

Actual Behavior: The other MPI process are running fine.

 

Note: If I kill my MPI process on any node, the rest of the process in the cluster is killed.

Tags (2)
0 Kudos
6 Replies
AbhishekD_Intel
Moderator
263 Views

Hi,

 

Ideally, your execution on all nodes should get killed, but it is not happening on your side.

So please check the value of environment variable I_MPI_FAULT_CONTINUE, if its value is ON then your program execution will continue to execute even if one of your nodes fails.


Also, please update your MPI version there are lots of problems which has been fixed in the latest release of Intel MPI.

And let us know if you are getting this same error using the latest MPI release.



Warm Regards,

Abhishek


tamilalagan
Novice
258 Views

Hello,

The "I_MPI_FAULT_CONTINUE"  is not set. And this will be set as disabled by default. 

Is there any open issue with Intel MPI 5.1.2.

AbhishekD_Intel
Moderator
235 Views

Hi,

Please update to the latest Intel MPI ( 2019 Update 8 ). We won't be able to help you much with the MPI version you are using, as there are lots of fixes after 5.1.2

 

 

Thank You,

Abhishek

 

AbhishekD_Intel
Moderator
197 Views

Hi,


Please give us an update on your issue with the latest MPI versions.



Warm Regards,

Abhishek


AbhishekD_Intel
Moderator
176 Views

We are assuming that the solution provided helped and would no longer be monitoring this issue. Please raise a new thread if you have further issues.


Thank you


tamilalagan
Novice
108 Views

I was looking for the concepts followed in Intel 5.1 MPI version on handling the process terminations during node failure. 

Even if there is no support, understanding the principle, will help us to make correct decision.

Reply