Community
cancel
Showing results for 
Search instead for 
Did you mean: 
seongyun_k_
Beginner
218 Views

MPI process hangs on 'MPI_Finalize'

Hi,

When I run my MPI application over 40 machines, one mpi process does not finish and hangs on 'MPI_Finalize'
(The other mpi processes in the other machines show zero cpu usage)

In the bellow, I attached the call stack of the process which is waiting on 'MPI_Finalize' function.

스크린샷 2016-03-07 오후 3.38.14.png

where do I have to start debugging? can anybody give me any clue on it?

0 Kudos
8 Replies
Mark_L_Intel
Employee
218 Views

Hello,

Could you provide your launch command. Also DAPL version, Intel MPI version, details about cluster HW if possible.

If you have time for more experiements, could you try i) just a few processes with DAPL ii) other fabrics (if available) such as OFA, TMI.

BR,

Mark

 

 

seongyun_k_
Beginner
218 Views

Hi, I used the following flags:

export I_MPI_PERHOST=1
export I_MPI_FABRICS=dapl
export I_MPI_FALLBACK=0
export I_MPI_DAPL_UD=1
export MPICH_ASYNC_PROGRESS=1
export I_MPI_RDMA_SCALABLE_PROGRESS=1
export I_MPI_PIN=1
export I_MPI_DYNAMIC_CONNECTION=0

I checked the dapl provider being used with 'I_MPI_DEBUG' option:

MPI startup(): Multi-threaded optimized library
MPI startup(): DAPL provider ofa-v2-mlx4_0-1u with IB UD extension
DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_UD_PROVIDER: ofa-v2-mlx4_0-1u
I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2

MPI Version: Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.2  Build 20151015

Hardware Spec:
OS : CentOS 6.4 Final
CPU : 2 * Intel® Xeon® CPU E5-2450 @ (2.10GHz, 8 physical cores)
RAM : 32GB per each
Ethernet: InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] 

I ran the application with a few machines (40 --> 5) over the similar data size per machine. It seems that the problem has gone.. but Why?

 

Mark_L_Intel
Employee
218 Views

Thank you for the detailed response.

can you run the job without "export MPICH_ASYNC_PROGRESS=1" to see if the issue goes away?

Also, could you provide a small reproducer?

Please provide the version of OFED/DAPL as well.

Thanks,

Mark

 

 

 

seongyun_k_
Beginner
218 Views

Hi,

- I am using OFED (MLNX_OFED_LINUX-2.4-1.0.4) which is the latest version that I can install with my NIC.
 Is this issue related with the driver's version (for example, the bug is reported quite long ago and fixed in the latest version?)

- When I disabled 'MPICH_ASYNC_PROGRESS', the program stops making progress on MPI functions. It waits on MPI_Wait (with request object from MPI_Rget) function forever (This also looks like a bug...)

thanks

Mark_L_Intel
Employee
218 Views

 

  could you send me a reproducer directly at mark.lubin@intel.com? This would really help.

Thanks,

Mark

Mark_L_Intel
Employee
218 Views

just got internal response regarding OFED version you use:

"

this version of MOFED has a really old dapl-2.1.3 package. They need to upgrade to latest dapl-2.1.8.

 

See download pages for changes since 2.1.3:  http://downloads.openfabrics.org/dapl/

 

Latest:  http://downloads.openfabrics.org/dapl/dapl-2.1.8.tar.gz

 "

 

 

Mark_L_Intel
Employee
218 Views

do you have Intel Premier Support account? If yes, could you submit a ticket?

Thanks

Mark

 

Mark_L_Intel
Employee
218 Views

it might be also a bug in your code. MPI Standard does not guarantee a real async progress for MPI applications (real overlapping of communications and computations).

Reply