- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zheyu from Microsoft here. During our internal testing of Intel MPI on HPC Clusters using OSU benchmarks, we found that Intel MPI would often get stuck on MPI_Init(). The symptom would be that one of the threads would have a stack trace like the following:
stack trace of one of the threads
while other threads would have a stack trace like the following:
stack trace of all other threads
Upon further extensive testing we found that the problem persists across multiple IMPI versions, UCX versions and other environment configurations. However, once we set I_MPI_STARTUP_MODE to pmi_shm, the hang goes away. This workaround, according to the documentation, comes with a performance cost for process startup.
We are suspecting that the bugfix mentioned in version 14.2 release notes, i.e. fixing of race condition in collectives causing hangs, is not properly applied to the collectives used in the startup process involved in the startup code path using netmod infrastructure (e.g. MPIDU_bc_allgather), causing the issue that we are witnessing. Could you investigate whether this is the case, or is this some other sort of race condition? What would be the planned release version for the fix?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@zheyushen
can you please share more details, which versions you tested exactly on which HW?
The fixes for collectives are not relevant for the startup case and as you see, some UCX functions are on the top so there is some UCX failure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IMPI versions of 13.1, 14.2 and 15.0 were extensively tested on A100 clusters. Various versions of UCX (from either HPC-X or DOCA_OFED) were tested. All exhibited the same hang behavior unless I_MPI_STARTUP_MODE is changed. Other variants of MPI (e.g. HPC-X or MVAPICH2+HPC-X UCX) don't have problems with the versions of UCX we are using.
The fixes for collectives IMO can still be relevant, as shown in the MPIDU_bc_allgather stack frame in the upper screenshot. UCX functions on top are just the result of busy-waiting caused by race-condition bug in IMPI layer IMO.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page