When I run my MPI application over 40 machines, one mpi process does not finish and hangs on 'MPI_Finalize'
(The other mpi processes in the other machines show zero cpu usage)
In the bellow, I attached the call stack of the process which is waiting on 'MPI_Finalize' function.
where do I have to start debugging? can anybody give me any clue on it?
Could you provide your launch command. Also DAPL version, Intel MPI version, details about cluster HW if possible.
If you have time for more experiements, could you try i) just a few processes with DAPL ii) other fabrics (if available) such as OFA, TMI.
Hi, I used the following flags:
I checked the dapl provider being used with 'I_MPI_DEBUG' option:
MPI startup(): Multi-threaded optimized library
MPI startup(): DAPL provider ofa-v2-mlx4_0-1u with IB UD extension
DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_UD_PROVIDER: ofa-v2-mlx4_0-1u
I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
MPI Version: Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.2 Build 20151015
OS : CentOS 6.4 Final
CPU : 2 * Intel® Xeon® CPU E5-2450 @ (2.10GHz, 8 physical cores)
RAM : 32GB per each
Ethernet: InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]
I ran the application with a few machines (40 --> 5) over the similar data size per machine. It seems that the problem has gone.. but Why?
Thank you for the detailed response.
can you run the job without "export MPICH_ASYNC_PROGRESS=1" to see if the issue goes away?
Also, could you provide a small reproducer?
Please provide the version of OFED/DAPL as well.
- I am using OFED (MLNX_OFED_LINUX-2.4-1.0.4) which is the latest version that I can install with my NIC.
Is this issue related with the driver's version (for example, the bug is reported quite long ago and fixed in the latest version?)
- When I disabled 'MPICH_ASYNC_PROGRESS', the program stops making progress on MPI functions. It waits on MPI_Wait (with request object from MPI_Rget) function forever (This also looks like a bug...)
just got internal response regarding OFED version you use:
this version of MOFED has a really old dapl-2.1.3 package. They need to upgrade to latest dapl-2.1.8.
See download pages for changes since 2.1.3: http://downloads.openfabrics.org/dapl/