I have a cluster of 13 nodes running.
While running my jobs on the cluster, one my node shuts down.
I would expect the MPI process running on the entire cluster should be killed. Whereas the rest of process are running fine, since the data pipeline is broken, the nodes are running out of memory.
Intel MPI Version: Intel MPI 5.1.2.
Expected Behavior: If one node shuts down, all the MPI process running should go down.
Actual Behavior: The other MPI process are running fine.
Note: If I kill my MPI process on any node, the rest of the process in the cluster is killed.
Ideally, your execution on all nodes should get killed, but it is not happening on your side.
So please check the value of environment variable I_MPI_FAULT_CONTINUE, if its value is ON then your program execution will continue to execute even if one of your nodes fails.
Also, please update your MPI version there are lots of problems which has been fixed in the latest release of Intel MPI.
And let us know if you are getting this same error using the latest MPI release.
Please update to the latest Intel MPI ( 2019 Update 8 ). We won't be able to help you much with the MPI version you are using, as there are lots of fixes after 5.1.2
I was looking for the concepts followed in Intel 5.1 MPI version on handling the process terminations during node failure.
Even if there is no support, understanding the principle, will help us to make correct decision.