- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A 3 node cluster with a cascade lake head node and sapphire rapids nodes is experiencing MPI jobs hanging. Running with "I_MPI_FABRICS=ofi" to exclude shm results in the jobs working. Running on just the 2 sapphire rapids systems works reliably.
The cluster is installed with RHEL 9.3, fully up to date and has the latest Intel HPC kit installed via the repo.
I did a fresh install of the OS from DVD and added only Intel HPC kit, and the problem still reproduced.
I've attached the output of running with I_MPI_DEBUG=255.
Thanks,
Rick
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just noticed I said the wrong cpu generations in the title - it's cascade lake and sapphire rapids, not skylake and cascade lake. I don't think it lets me change the subject
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I added another cascade lake system to the cluster to test. The job fails on 2 cascade lake systems too, so mixing isn't the issue. It seems like it's a bug specific to cascade lake systems only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
could you please try to set
I_MPI_PLATFORM=clx
or
I_MPI_PLATFORM=auto?
I guess a single node job with I_MPI_FABRICS=shm:ofi and I_MPI_FABRICS=shm also works on all nodes?
Best
Tobias
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tobias,
Thanks for the help!
Yes, single node jobs on the Cascade lake systems work with shm without a problem. The issue only happens on multinode jobs.
I_MPI_PLATFORM=clx seems to be working properly! I'm doing further testing to ensure it all works 100% but I think this takes care of the issue.
Thanks,
Rick
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page