Node shuts down in the middle of MPI process running

tamilalagan · ‎08-13-2020

Hello All,

I have a cluster of 13 nodes running.

While running my jobs on the cluster, one my node shuts down.

I would expect the MPI process running on the entire cluster should be killed. Whereas the rest of process are running fine, since the data pipeline is broken, the nodes are running out of memory.

Intel MPI Version: Intel MPI 5.1.2.

Expected Behavior: If one node shuts down, all the MPI process running should go down.

Actual Behavior: The other MPI process are running fine.

Note: If I kill my MPI process on any node, the rest of the process in the cluster is killed.

AbhishekD_Intel · ‎08-14-2020

Hi,

Ideally, your execution on all nodes should get killed, but it is not happening on your side.

So please check the value of environment variable I_MPI_FAULT_CONTINUE, if its value is ON then your program execution will continue to execute even if one of your nodes fails.

Also, please update your MPI version there are lots of problems which has been fixed in the latest release of Intel MPI.

And let us know if you are getting this same error using the latest MPI release.

Warm Regards,

Abhishek

tamilalagan · ‎08-14-2020

Hello,

The "I_MPI_FAULT_CONTINUE" is not set. And this will be set as disabled by default.

Is there any open issue with Intel MPI 5.1.2.

AbhishekD_Intel · ‎08-17-2020

Hi,

Please update to the latest Intel MPI ( 2019 Update 8 ). We won't be able to help you much with the MPI version you are using, as there are lots of fixes after 5.1.2

Thank You,

Abhishek

AbhishekD_Intel · ‎08-28-2020

Hi,

Please give us an update on your issue with the latest MPI versions.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎09-02-2020

We are assuming that the solution provided helped and would no longer be monitoring this issue. Please raise a new thread if you have further issues.

Thank you

tamilalagan · ‎10-21-2020

I was looking for the concepts followed in Intel 5.1 MPI version on handling the process terminations during node failure.

Even if there is no support, understanding the principle, will help us to make correct decision.