Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

How to catch MPI exception

Jimmy821
Beginner
1,103 Views
Hi,

Is there support for MPI::ERRORS_THROW_EXCEPTIONS?

I notice thatany exception is not caught when there is network loss.

Thanks.
0 Kudos
5 Replies
Dmitry_K_Intel2
Employee
1,103 Views
Hi Jimmy,

Please take a look at the example: here
If you do everything correctly but cannot catch an exception that probably means that MPI functon doesn't return error code.

Regards!
Dmitry
0 Kudos
Andrey_D_Intel
Employee
1,103 Views
Hi,

Could you please clarify what MPI implementation we are talking about? In the Intel MPi Library the MPI::ERRORS_THROW_EXCEPTIONS is supported according to MPI standard specifications.

Best regards,
Andrey
0 Kudos
Jimmy821
Beginner
1,103 Views
I am using Intel MPI 4.0. I am running 3 instances of my application on the same computer. To test the exception handling, I forcefully terminate one instance of the application.

However, it appears that the catch block of the 2 other instances are not triggered. I use standard MPI functions such as MPI_TEST, MPI_BCAST, MPI_IRECV, MPI_SEND, MPI_PEEK.

Can I additionally check how to use the I_MPI_TCP_NETMASK flag in a configuration file. I could not include this in any way.

Thanks!
0 Kudos
jimmy82
Novice
1,103 Views
Just a quick update... I realised that I am able to catch an exception due to software error. For example, there is a mis-match between data size.

However, my objective is to catch errors due to network disconnection, or the other nodes hang abruptly. In this case, I read that there is no way because mpiexec does not trap the errors and will proceed to terminate all running processes.
0 Kudos
Dmitry_K_Intel2
Employee
1,103 Views
Hi Jimmy,

Please read clause 5 of the Reference Manual about fault tolerance - might be this is your case (or might be you are talking about check-points).
Mpiexec does not catch errors! Mpiexec aborts an application if one of the processes has been aborted because of error.

Regards!
Dmitry
0 Kudos
Reply