Trying to make use of the MPI persistent communication primitives in our application, I'm ending up with the following sequence of events:
MPI_Ssend_init(msg, msg_length, MPI_BYTE, 0, tag, comm, &req); MPI_Start(&req); MPI_Cancel(&req); MPI_Wait(&req, MPI_STATUS_IGNORE); MPI_Request_free(&req); // <-- HANGS
The only other node is blocked in an MPI_Barrier(comm);
I noticed that if I comment out the MPI_Barrier() call and let the other node proceed to freeing the communicator and then enter some other MPI_Barrier() on a different communicator, then the MPI_Request_free() call magically returns.
I tried reproducing this in a separate test program, but everything works as expected there. So I understand that there is probably some (possibly unrelated) bug in my original application that causes this behaviour and that one would need more information in order to figure this out.
But what puzzles me is that MPI_Request_free() blocks, even though the standard says that it is supposed to be a local operation (i.e. its completion should not depend on any other nodes).
So my main questions are:
Thanks in advance!
I'm not able to easily reproduce the hang. I don't think you should be encountering a hang here. Can you run with the message checking library in Intel® Trace Analyzer and Collector? That should show if there is a problem in your usage of MPI. Also, can you send me your code showing the hang? Private message is fine.
Thanks for the quick reply, James!
I'm not allowed to share the source code of our full application, unfortunately, and we don't have a license for ITAC.
But I grabbed an Evaluation of Intel Parallel Studio 2016 and noticed that with the included Intel MPI 5.1.1 our application (same binaries, didn't even recompile) runs successfully and mpirun --check-mpi does not find any issues at all.
Could it be a bug in the library itself that got fixed in the meantime?