Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Notification of a failed dead node existence using the PSM2

RKraw
Beginner
493 Views

Hello everyone,

I am writing because I am currently implementing a failure recovery system for a cluster with Intel OmniPath that will be designated for handling computations in a physical experiment. What I want to implement is a mechanism to detect a node that failed and to notify rest of the nodes. I tried to check the node failure by invoking psm2_poll. Unfortunately, as I saw in the Intel ® Performance ScaledMessaging 2 (PSM2) Programmer’s Guide, this function does not return errors (values) other than OK or OK_NO_PROGRESS (this is at least what I have observed in my application - the poll on a dead node behaves as if the node did not fail/disconnect and did not send any message). 

So the question is: What are the methods of notifying other nodes after node failure ? Is there a lightweight function that I can invoke along with poll to check if the node from whom I am trying to get messages exists ?

In worst case, I can implement this using a counter and a timeout, but if there is a mechanism supported by the API, I am wide open.

Best Regards

0 Kudos
2 Replies
RKraw
Beginner
493 Views

Hello Again,

I think I may give more detailed information on the topic.

In the documentation I have read that the following function:

psm2_error_register_handler

 

Allows one of the three options: If I understand correctly, either not to use the handler in PSM2_ERRHANDLER_NO_HANDLER (and subsequently to read the errors from returned values from PSM2 function invocations), to defer error handling in PSM2_ERRHANDLER_PSM_HANDLER, OR, to use a user defined function.

So my question are the folliwing:

1. Does the psm2_poll function can return other errors than presented in the previous mail (such as connection failure). Un this case, I could simply check

2. How can I define my own handler - unfortunately I did not see any example application of introducing user-defined handler so a code sample would be welcome. I assume I will be needing special handle for broken connection (and errors such as PSM2_EP_WAS_CLOSED or PSM2_EP_UNIT_NOT_FOUND or others ) - how to do that ?

 

Best Regards

0 Kudos
RKraw
Beginner
493 Views

Hello Again,

So this is what I figured out:

1. I am defining my own handler in such a form (I have taken it fromm a compiler errors than I tried to register a handler):

psm2_error myErrorFunc( psm2_ep* ep, psm2_error err, const char* achar, psm2_error_token*  token)
{

// body
       return err;
}

 

2. I am registering it as follows:   

psm2_error_register_handler(NULL, &myErrorFunc);

 

Now, the 2 following questions are

- the 4 parameters of the handler - what they stand for ? I want to retrieve information which remote node failed to update my own communication data.

- How to make a handler be called upon disconnection of a node or any connection failure of a remote node ?

Best Regards

 

0 Kudos
Reply