Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Freeze and performance issue with IntelMPI 2021.07 & 2021.08 on Infiniband (OneAPI 2022.03 & 2023.0)

LucasH
Beginner
738 Views

Hello to all,

For the context, I am working on the implementation of a new HPC cluster and I don't have a lot of experience in the specific subject of MPI communications.

Here is my problem:

  • I have already validated several applications using IntelMPI 2021.07, the tests were conclusive and the performance was much better than what we were getting on our old infrastructure.
  • As for the first applications that caused me problems, I use the IntelMPI distribution provided in OneAPI and not the one provided with the code because I encounter a freeze for all IntelMPI < 2021 (Intel support could not help me other than by upgrading the version, no worries it suits my needs).
  • However, I have an application for which I don't have a "clean" and stable solution:

If I load the OneAPI 2022.03 module without setting any additional variables, my application abruptly stops on the error on Case 2 or 3: https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/troubleshooting/error-message-fatal-error.html

If I load the OneAPI 2022.03 module with "I_MPI_OFI_PROVIDER" set to mlx (My goal because I want to go through Infiniband and not ethernet), I get the same error.

For psm3, it launches but the calculation is very slow! (psm only on via TCP ?)
To make it work faster, I have to declare the variables "MPIR_CVAR_CH4_OFI_TAG_BITS" & "MPIR_CVAR_CH4_OFI_RANK_BITS", for example by giving them the values 31 and 20 respectively. However, this doesn't work for all cases and we often have to change the values. Why do you have to declare these variables by hand? What values should be entered to handle most cases?

On the other hand, no worries by setting "I_MPI_OFI_PROVIDER" to sockets, except that I don't want to use ethernet but Infiniband

I have redone all these tests with OneAPI 2023.0 to use IntelMPI 2021.08 but no difference.

I have attached to this post several logs describing the configuration of my machines, feel free if you need additional information.

Sincerely,

Lucas

Labels (2)
0 Kudos
3 Replies
SantoshY_Intel
Moderator
681 Views

Hi,

 

Thanks for posting in Intel communities.

 

Could you please provide us a sample reproducer code along with the steps to reproduce your issue?

Also, please let us know which job scheduler you are using.

 

Thanks & Regards,

Santosh

 

0 Kudos
SantoshY_Intel
Moderator
654 Views

HI,


We haven't heard back from you. Could you please provide us with the requested details?


Thanks & Regards,

Santosh


0 Kudos
SantoshY_Intel
Moderator
623 Views

Hi,


I have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks & Regards,

Santosh


0 Kudos
Reply