Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

Intel MPI: MPI_Win_wait hangs forever when using ofi fabric with psm2 provider

Michael_Lass
Beginner
5,185 Views

On an HPC cluster with Omni-Path interconnect, the attached demo code hangs at the call to MPI_Win_wait on rank 0 when run with two processes on two distinct nodes. The problem only occurs with Intel MPI and not with OpenMPI.

Interestingly, the problem can be circumvented by different settings:

  • Setting FI_PROVIDER=tcp avoids the use of psm2 by OFI. This remedies the issue, hinting towards a problem with psm2.
  • Setting I_MPI_DEBUG=1 also gets rid of the problem although this setting should not change the behavior of the code apart from creating verbose output.
  • Setting I_MPI_FABRIC=foobar also gets rid of the problem. And I actually mean "foobar", i.e. a non-existing fabric. This should not change the behavior at all as MPI will fall back to ofi in this case.

The last bullet point makes me really wonder what the problem could be here. I ran the code with and without I_MPI_FABRIC=foobar and in both cases also set FI_LOG_LEVEL=debug to get verbose output from ofi. Apart from the ordering of lines the outputs are totally identical. However, in one case the code freezes and in the other it does not. Maybe this is some race condition that is influenced by these little environment changes.

I can reproduce the problem with Intel MPI versions 2021.4.0 and 2021.5.0, using the corresponding ICC versions for compilation. Note that it cannot be reproduced with two processes running on the same node.

Attached files:

  • test.c - Source code that reproduces the problem.
  • impi_debug.txt - Output when setting I_MPI_DEBUG=1. As mentioned, this hides the issue.
  • ofi_debug_freezing.txt - Output when setting I_MPI_FABRIC=foobar. The program freezes.
  • ofi_debug_invalid_fabric.txt - Same as the previous but with I_MPI_FABRIC=foobar set. The program does not freeze.
0 Kudos
22 Replies
James_T_Intel
Moderator
491 Views

At this point, since we are unable to reproduce internally, I'm afraid this is going to come down to a system configuration issue.


Maybe reinstalling the Omni-Path drivers could help. Or reinstalling Intel® MPI Library.


0 Kudos
James_T_Intel
Moderator
439 Views

Since there has been no update, I am closing this for Intel support. Any further replies on this thread will be considered community only.


0 Kudos
Reply