Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Any good tools/methods to debug MPI based program?

Zhanghong_T_
Novice
1,108 Views

Dear all,

I have a MPI-based Fortran code that can run with single or two processes, however, when lunch the program with more processes, for example, 4 processes, the program crashed with the following message:

forrtl: severe (157): Program Exception - access violation
forrtl: severe (157): Program Exception - access violation

job aborted:
rank: node: exit code[: error message]
0: N01: 123
1: N01: 123
2: n02: 157: process 2 exited without calling finalize
3: n02: 157: process 3 exited without calling finalize

 

I tried to add print message and mpi_barrier to trace the problem, but still failed. Is there any debug tools or methods to debug the MPI based program? The command lines I run the program is as follows:

mpiexec -wdir "\\N02\Debug\directional\for_debug\mytest" -mapall -hosts 10 n01 2 n02 2 n03 2 n04 2 n05 2 n06 2 n07 2 n08 2 n09 2 n10 2 \\N02\Debug\directional\for_debug\test

 

Thanks,

Zhanghong Tang

0 Kudos
6 Replies
Zhanghong_T_
Novice
1,108 Views

Further check I found that when I run the mpiexec on another host instead of n01, for example, n10, the program works, or if I run the mpiexec on n01, but the command line is as follows:

mpiexec -wdir "\\N02\Debug\directional\for_debug\mytest" -mapall -hosts 10 n02 2 n01 2 n03 2 n04 2 n05 2 n06 2 n07 2 n08 2 n09 2 n10 2 \\N02\Debug\directional\for_debug\test

The program also works. So it seems that the the problem is related to myid=0, but all hosts used the same work folder, could anyone help me to take a look at it?

 

Thanks

0 Kudos
John_D_6
New Contributor I
1,108 Views

Hi Zhanghong,

in your case, you should be able to use a core dump to check what's the problem.

More in general, besides the commercial debuggers for parallel applications, there are some free tools that I use regularly to debug MPI-programs:

  1. strace (from version 4.9): you can get a strack trace of your program at a specific system call with the option -k. Enable it for the function 'exit_group':

    strace -k -eexit_group -ostrace.out [my_application]

    and it should give you a backtrace at the moment that an MPI-application stops. This is useful if your application stops gracefully (so no core dump), but doesn't tell you where or why it stopped.
     
  2. padb: http://padb.pittman.org.uk/. It gives you a 'unified' backtrace of all running MPI-processes. This is especially useful if your MPI-application hangs.
0 Kudos
Zhanghong_T_
Novice
1,108 Views

Dear Dr John,

Thank you very much for your kindly reply. I work on Windows 7 system, I don't know whether these two tools you recommended could work on Windows system or not.

Thanks

0 Kudos
high_end_c_
Beginner
1,108 Views

Hi

It's unclear whether the code or how you have implemented MPI is the cause. You should be open minded to either.

If your preferred (serial) debugger is Visual Studio (VS), then you can use this to help with debugging. 
Presuming you have already integrated your MPI implentation with VS, then launching

 mpiexec -n 4 full_VS_Executable_name full_MPI_Executable name

should start 4 instances of VS each running one MPI process. Start each process, one by one in each VS instance, and then off you go.

 

Yours, Michael

http://highendcompute.co.uk

@highendcompute

0 Kudos
John_D_6
New Contributor I
1,108 Views

ah, I see. Indeed these tools are available on linux- and unix-based systems, so I'm afraid these will not help you. Unless you'd migrate OS, of course.

0 Kudos
Artem_R_Intel1
Employee
1,108 Views

Hi Zhanghong,

You may try to attach to the problem MPI process with WinDbg.

0 Kudos
Reply