Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2239 Discussions

Debugging my application with Intel mpirun on Linux

vipulk
Beginner
6,393 Views

I am trying to debug my MPI application. This has been built on CentOS 6.7 using Intel MPI libraries (version info below).

I am trying run gdb in xterm (one each for each process). However, I am getting errors when the application calls MPI_INIT(). To invoke the run, I execute as

$ mpirun -np <N> xterm -e gdb --args <application along with arguments>

However, I get below errors (pasted below) for one of the processes. Interestingly, regardless of number of processes I run, this error always occurs in process with rank 2. The application runs successfully, if I run without gdb "mpirun -np <N> <application with arguments>"

I am looking for help to try to figure out how t make it run. I am trying to get my application to move away from OpenMPI to Intel MPI, but this is a critical piece that needs to work for us to adapt. The total number of ranks will typically be a small number for us (< 16), so it is manageable using xterm. In fact, finally, we will bring it up on gdb under emacs, which provides a much better debugging experience.

Appreciate any help that we can get.

 

Thanks,

Vipul

 

[cli_2]: write_line error; fd=17 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_2]: Unable to write to PMI_fd
[cli_2]: write_line error; fd=17 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136):
MPID_Init(709).......:
MPIR_pmi_init(105)...: PMI_Get_appnum returned -1
[cli_2]: write_line error; fd=17 buf=:cmd=abort exitcode=1091087
:
system msg for write_line failure : Bad file descriptor

Intel(R) MPI Library for Linux* OS, Version 2019 Update 7 Build 20200312 (id: 5dc2dd3e9)

Labels (1)
0 Kudos
8 Replies
PrasanthD_intel
Moderator
6,379 Views

Hi Vipul,

 

Thanks for reaching out to us.

We have tried to replicate the issue with our sample program and found no error.

Since you are getting Bad file descriptor error only while debugging with gdb, can you once check with -gtool for debugging.

 

mpirun -n 16 -gtool "gdb:3,5,7-9=attach" ./myprog

 

Also, have you got a similar error while using gdb without xterm?

 mpirun -gdb -n 4 ./test

 

 For details on how to use -gtool and gdb please refer:

 https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/debugging-applications/debugging.html

 

https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/debugging-applications/using-gtool-for-debugging.html

 

Is it possible for you to provide the code or a sample reproducer so we can test from our side?

 Regards

 Prasanth

0 Kudos
vipulk
Beginner
6,374 Views

Hi Prasanth,

Thanks for your response.

Yes, I am able to run the application in gdb using -gdb option of mpirun. However, it gives the below error when I try to run the gdb in a separate window.

 

I am able to reproduce this problem with a simple c test (ring example from openmpi). As I mentioned earlier, strangely, this error occurs only in process with rank 2.

 

I have attached the C application code for your reference. I run as 

$ mpirun -n 5 xterm -e gdb --args /med/d/vipulk/sandboxes/try/ring_c 10

The application is compiled using gcc 6.2 with below command:

/med/build/gcc/gcc-6.2.0/rhel6/bin/gcc -I<>/INTELMPI/compilers_and_libraries_2020.1.217/linux/mpi/intel64/include /med/d/vipulk/sandboxes/intelmpi/ring_c.c -o /med/d/vipulk/sandboxes/try/ring_c -L<>/INTELMPI/compilers_and_libraries_2020.1.217/linux/mpi/intel64/lib/release -lmpi

Thanks,

Vipul

 

0 Kudos
vipulk
Beginner
6,370 Views

Is there a way/option for mpirun to not capture stdout/stderr of application execution?

I am thinking that the issue could be that in my execution, the stdout of the application has been captured by gdb, but maybe, mpirun is also trying to capture the same (which causes the problem).

I have been trying to work around this issue by running mpirun separately and then attaching the application process to my gdb independently. This works, however, the stdout/stderr still gets captured by mpirun, I am not able to see the stdout/stderr in my gdb execution.

Thanks,

Vipul

 

0 Kudos
PrasanthD_intel
Moderator
6,338 Views

Hi Vipul,

 

We have tried with the ring program you have provided and the program ran without any errors in xterm gdb.

I have attached the screenshot of process 2 running.

Also, we want to know is there any specific use case to use gdb in the external window?

I am not aware of any such option to not capture stdout/stderr by mpirun which you were asking. I am doubtful whether it is possible or not.

Could you post your command line or screenshots of what leading you to the file descriptor error?

 

Thanks

Prasanth

 

0 Kudos
PrasanthD_intel
Moderator
6,304 Views

Hi Vipul,

Is your problem resolved? if not please update us.

Also, can you provide us the command line of yours to reproduce the error, as we were not getting the error you have reported with the same program?


Regards

Prasanth


0 Kudos
vipulk
Beginner
6,287 Views

Hi Prasanth,

Sorry about a late reply.

I am using the command 'mpirun -n 7 xterm -e gdb --args ring_c 10' to run the application under gdb. 

I didn't reply earlier because I was having trouble reproducing the issue. It turns out that by chance I was running on a different host with a different gdb in path and then issue didn't reproduce.

The gdb that seems to be working happens to be version 7.6.1-114.el7 (which is /bin/gdb on that host with OS CentOS 7.6.1810). On other machines, I have attempted to use newer gdb version (8.2), but it gives the same problem (process with PMI_RANK 2 fails in MPI_INIT function).

So, it appears to be related to gdb version (or may be some other configuration that I cannot yet understand). Do you have ideas that I could try?

Thanks,

Vipul

 

0 Kudos
PrasanthD_intel
Moderator
6,261 Views

Hi Vipul,


We have tested the code you have provided with xterm gdb 8.2 multiple times but did not get the error.

This seems like a problem from GDB and not Intel MPI.

We are transferring you query to the Subject matter experts for better suggestions.


Regards

Prasanth



0 Kudos
vipulk
Beginner
6,257 Views

Hi Prasanth,

Thanks for trying this out. I have not been able to understand the cause so far with my experiments.

So, please do let me know if you find anything. It is also possible that there is a setup/configuration issue here. But I cannot figure out what.

Thanks,

Vipul

0 Kudos
Reply