Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2273 Discussões

MPI job hangs when including both Skylake and Cascade lake nodes - disabling shm works

RickW_Microway
Principiante
1.879 Visualizações

A 3 node cluster with a cascade lake head node and sapphire rapids nodes is experiencing MPI jobs hanging.  Running with "I_MPI_FABRICS=ofi" to exclude shm results in the jobs working.  Running on just the 2 sapphire rapids systems works reliably.

 

The cluster is installed with RHEL 9.3, fully up to date and has the latest Intel HPC kit installed via the repo.

 

I did a fresh install of the OS from DVD and added only Intel HPC kit, and the problem still reproduced.

 

I've attached the output of running with I_MPI_DEBUG=255.

 

Thanks,

Rick

0 Kudos
4 Respostas
RickW_Microway
Principiante
1.877 Visualizações

I just noticed I said the wrong cpu generations in the title - it's cascade lake and sapphire rapids, not skylake and cascade lake.  I don't think it lets me change the subject

RickW_Microway
Principiante
1.828 Visualizações

I added another cascade lake system to the cluster to test.  The job fails on 2 cascade lake systems too, so mixing isn't the issue.  It seems like it's a bug specific to cascade lake systems only.

TobiasK
Moderador
1.784 Visualizações

@RickW_Microway


could you please try to set

I_MPI_PLATFORM=clx

or

I_MPI_PLATFORM=auto?


I guess a single node job with I_MPI_FABRICS=shm:ofi and I_MPI_FABRICS=shm also works on all nodes?

Best

Tobias


RickW_Microway
Principiante
1.766 Visualizações

Hi Tobias,

 

Thanks for the help!

 

Yes, single node jobs on the Cascade lake systems work with shm without a problem.  The issue only happens on multinode jobs.

 

I_MPI_PLATFORM=clx seems to be working properly!  I'm doing further testing to ensure it all works 100% but I think this takes care of the issue.

 

Thanks,

Rick

Responder