Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1940 Discussions

Suspending the issuance of INTERNAL_ERRO when sending a message MPI is not completed

alexdom2000
Beginner
128 Views
Hi,

I'm implementing techniques for fault tolerance using the Intel MPI.

I have the following scenario: two hosts (A and B) communicate via MPI messages.
Host B has a failure (crash or loss of link, for example). Host A that was trying to send a message to host B, can not complete the deployment,. There is the closure of the application about 15 minutes later, because the generation of a INTERNAL_ERRO. This occurs because of the failure of the various attempts to send TCP defined (these attempts are defined tcp_retry2).

The same procedure performed in OpenMPI does not have the same fate, ie the application is not interrupted.

Is there any way to disable the issuance of this bug in Intel MPI?
More clearly, disable the generation of INTERNAL_ERRO due to not completing the post even after several attempts the TCP layer defined tcp_retry?

Thank's
Alexandre D.Gonalves
0 Kudos
1 Reply
Dmitry_K_Intel2
Employee
128 Views
Hi Alexandre,

You can try to set I_MPI_FAULT_CONTINUE=on:
$ mpiexec -env I_MPI_FAULT_CONTINUE on -n 2 ./test

Regards!
Dmitry
Reply