Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2263 Discussions

Suspected unfixed Intel MPI race condition in collectives

zheyushen
Beginner
366 Views

Zheyu from Microsoft here. During our internal testing of Intel MPI on HPC Clusters using OSU benchmarks, we found that Intel MPI would often get stuck on MPI_Init(). The symptom would be that one of the threads would have a stack trace like the following:

stack trace of one of the threadsstack trace of one of the threads

while other threads would have a stack trace like the following:

stack trace of all other threadsstack trace of all other threads

Upon further extensive testing we found that the problem persists across multiple IMPI versions,  UCX versions and other environment configurations. However, once we set I_MPI_STARTUP_MODE to pmi_shm, the hang goes away. This workaround, according to the documentation, comes with a performance cost for process startup.

We are suspecting that the bugfix mentioned in version 14.2 release notes, i.e. fixing of race condition in collectives causing hangs, is not properly applied to the collectives used in the startup process involved in the startup code path using netmod infrastructure (e.g. MPIDU_bc_allgather), causing the issue that we are witnessing. Could you investigate whether this is the case, or is this some other sort of race condition? What would be the planned release version for the fix?

Labels (1)
0 Kudos
3 Replies
TobiasK
Moderator
283 Views

@zheyushen 
can you please share more details, which versions you tested exactly on which HW?

The fixes for collectives are not relevant for the startup case and as you see, some UCX functions are on the top so there is some UCX failure.

0 Kudos
zheyushen
Beginner
261 Views

IMPI versions of 13.1, 14.2 and 15.0 were extensively tested on A100 clusters. Various versions of UCX (from either HPC-X or DOCA_OFED) were tested. All exhibited the same hang behavior unless I_MPI_STARTUP_MODE is changed. Other variants of MPI (e.g. HPC-X or MVAPICH2+HPC-X UCX) don't have problems with the versions of UCX we are using.

The fixes for collectives IMO can still be relevant, as shown in the MPIDU_bc_allgather stack frame in the upper screenshot. UCX functions on top are just the result of busy-waiting caused by race-condition bug in IMPI layer IMO.

0 Kudos
zheyushen
Beginner
94 Views
0 Kudos
Reply