Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2221 Discussions

MPI job hangs when including both Skylake and Cascade lake nodes - disabling shm works

RickW_Microway
Beginner
1,496 Views

A 3 node cluster with a cascade lake head node and sapphire rapids nodes is experiencing MPI jobs hanging.  Running with "I_MPI_FABRICS=ofi" to exclude shm results in the jobs working.  Running on just the 2 sapphire rapids systems works reliably.

 

The cluster is installed with RHEL 9.3, fully up to date and has the latest Intel HPC kit installed via the repo.

 

I did a fresh install of the OS from DVD and added only Intel HPC kit, and the problem still reproduced.

 

I've attached the output of running with I_MPI_DEBUG=255.

 

Thanks,

Rick

0 Kudos
4 Replies
RickW_Microway
Beginner
1,494 Views

I just noticed I said the wrong cpu generations in the title - it's cascade lake and sapphire rapids, not skylake and cascade lake.  I don't think it lets me change the subject

0 Kudos
RickW_Microway
Beginner
1,445 Views

I added another cascade lake system to the cluster to test.  The job fails on 2 cascade lake systems too, so mixing isn't the issue.  It seems like it's a bug specific to cascade lake systems only.

0 Kudos
TobiasK
Moderator
1,401 Views

@RickW_Microway


could you please try to set

I_MPI_PLATFORM=clx

or

I_MPI_PLATFORM=auto?


I guess a single node job with I_MPI_FABRICS=shm:ofi and I_MPI_FABRICS=shm also works on all nodes?

Best

Tobias


0 Kudos
RickW_Microway
Beginner
1,383 Views

Hi Tobias,

 

Thanks for the help!

 

Yes, single node jobs on the Cascade lake systems work with shm without a problem.  The issue only happens on multinode jobs.

 

I_MPI_PLATFORM=clx seems to be working properly!  I'm doing further testing to ensure it all works 100% but I think this takes care of the issue.

 

Thanks,

Rick

0 Kudos
Reply