Community
cancel
Showing results for 
Search instead for 
Did you mean: 
vineetsoni
Beginner
100 Views

Intel MPI RC transport hangs

Hello,

I am facing an issue with an MPI program hanging when using Intel MPI.

Characteristics of the system:

  • Intel MPI version: 2018 Update 4 Build 20180823
  • Network type: Mellanox InfiniBand HDR100
  • Network topology: Dragonfly
  • CPU: AMD Epyc 7742

When I use only 2 nodes (256 processes), the code works fine. But, when I use 8 nodes, the behaviour is random i.e. most of the time it hangs, but sometimes it gives segfault error.

The stack trace at the time of hanging shows that the processes are stuck at:
dapl_rc_vc_progress_short_msg_20() at ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:483

However, if I enable UD transport via export I_MPI_DAPL_UD=on, it works fine. With UD, the code works even on 10k procs.

My question is: how to know what causes RC (RDMA) to hang (or segfault) the computation? And, how can I fix it?

I would prefer to take advantage of RC up to at least 8 nodes, and then for larger runs, I can switch to UD (if needed) to save memory.

Please note that I do not face this problem with Open MPI or MVAPICH2.

Thanks in advance.

Best,
Vineet

Labels (1)
0 Kudos
3 Replies
PrasanthD_intel
Moderator
69 Views

Hi Vineet,


The Intel MPI version you were using was old and unsupported now. For the list of supported versions refer Intel® Parallel Studio XE & Intel® oneAPI Toolkits...

Since IMPI 2019 the Intel® MPI Library switched from the Open Fabrics Alliance* (OFA) framework to the Open Fabrics Interfaces* (OFI) framework.

Can you upgrade to the latest version? There have been many bug fixes and performance improvements since the 2018 version.


Regards

Prasanth



PrasanthD_intel
Moderator
45 Views

Hi Vineet,


We haven't heard back from you.

Have you updated to the latest version of MPI?

Let us know if you face any problems while updating.


Regards

Prasanth


vineetsoni
Beginner
38 Views

Hi Prasanth,

Unfortunately, I am not the administrator of the machine. So, I do not have control over it. I can try to install the newer version of Intel MPI in my home directory, but it will not be a practical solution as other MPI implementations already work with RC.

I wanted to know if there are some (hidden) RC-RDMA related environment variables that can help in fixing this issue.

Anyway, I will ask the system admin to know why they recommend using only UD with Intel MPI as they must have faced the same problem.

Thanks,

Vineet