I am facing an issue with an MPI program hanging when using Intel MPI.
Characteristics of the system:
When I use only 2 nodes (256 processes), the code works fine. But, when I use 8 nodes, the behaviour is random i.e. most of the time it hangs, but sometimes it gives segfault error.
The stack trace at the time of hanging shows that the processes are stuck at:
dapl_rc_vc_progress_short_msg_20() at ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:483
However, if I enable UD transport via export I_MPI_DAPL_UD=on, it works fine. With UD, the code works even on 10k procs.
My question is: how to know what causes RC (RDMA) to hang (or segfault) the computation? And, how can I fix it?
I would prefer to take advantage of RC up to at least 8 nodes, and then for larger runs, I can switch to UD (if needed) to save memory.
Please note that I do not face this problem with Open MPI or MVAPICH2.
Thanks in advance.
The Intel MPI version you were using was old and unsupported now. For the list of supported versions refer Intel® Parallel Studio XE & Intel® oneAPI Toolkits...
Since IMPI 2019 the Intel® MPI Library switched from the Open Fabrics Alliance* (OFA) framework to the Open Fabrics Interfaces* (OFI) framework.
Can you upgrade to the latest version? There have been many bug fixes and performance improvements since the 2018 version.
Unfortunately, I am not the administrator of the machine. So, I do not have control over it. I can try to install the newer version of Intel MPI in my home directory, but it will not be a practical solution as other MPI implementations already work with RC.
I wanted to know if there are some (hidden) RC-RDMA related environment variables that can help in fixing this issue.
Anyway, I will ask the system admin to know why they recommend using only UD with Intel MPI as they must have faced the same problem.