I am writing because I am currently implementing a failure recovery system for a cluster with Intel OmniPath that will be designated for handling computations in a physical experiment. What I want to implement is a mechanism to detect a node that failed and to notify rest of the nodes. I tried to check the node failure by invoking psm2_poll. Unfortunately, as I saw in the Intel ® Performance ScaledMessaging 2 (PSM2) Programmer’s Guide, this function does not return errors (values) other than OK or OK_NO_PROGRESS (this is at least what I have observed in my application - the poll on a dead node behaves as if the node did not fail/disconnect and did not send any message).
So the question is: What are the methods of notifying other nodes after node failure ? Is there a lightweight function that I can invoke along with poll to check if the node from whom I am trying to get messages exists ?
In worst case, I can implement this using a counter and a timeout, but if there is a mechanism supported by the API, I am wide open.
I think I may give more detailed information on the topic.
In the documentation I have read that the following function:
Allows one of the three options: If I understand correctly, either not to use the handler in PSM2_ERRHANDLER_NO_HANDLER (and subsequently to read the errors from returned values from PSM2 function invocations), to defer error handling in PSM2_ERRHANDLER_PSM_HANDLER, OR, to use a user defined function.
So my question are the folliwing:
1. Does the psm2_poll function can return other errors than presented in the previous mail (such as connection failure). Un this case, I could simply check
2. How can I define my own handler - unfortunately I did not see any example application of introducing user-defined handler so a code sample would be welcome. I assume I will be needing special handle for broken connection (and errors such as PSM2_EP_WAS_CLOSED or PSM2_EP_UNIT_NOT_FOUND or others ) - how to do that ?
So this is what I figured out:
1. I am defining my own handler in such a form (I have taken it fromm a compiler errors than I tried to register a handler):
psm2_error myErrorFunc( psm2_ep* ep, psm2_error err, const char* achar, psm2_error_token* token)
2. I am registering it as follows:
Now, the 2 following questions are
- the 4 parameters of the handler - what they stand for ? I want to retrieve information which remote node failed to update my own communication data.
- How to make a handler be called upon disconnection of a node or any connection failure of a remote node ?