- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Background and summary of problem
I am the developer of a debugging tool called mdb (https://github.com/TomMelt/mdb?tab=readme-ov-file)
This tool is written in Python but it is essentially a wrapper around various debugging backends. It currently works with gdb, cuda-gdb and lldb.
I mostly use openMPI but I have collaborators that use intel oneapi MPI.
When I was testing my tool (mdb) with intel MPI I get a crash when I try to step over the initialization of MPI (when launched with intel MPI's mpirun). E.g.,
MPI_Init(NULL, NULL);
The error I get is:
0: Continuing.
0: [cli_0]: write_line error; fd=9 buf=:cmd=init pmi_version=1 pmi_subversion=1
0: :
0: system msg for write_line failure : Bad file descriptor
0: [cli_0]: Unable to write to PMI_fd
0: [cli_0]: write_line error; fd=9 buf=:cmd=get_appnum
0: :
0: system msg for write_line failure : Bad file descriptor
0: Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
0: MPIR_Init_thread(176):
0: MPID_Init(1439)......:
0: MPIR_pmi_init(131)...: PMI_Get_appnum returned -1
0: [cli_0]: write_line error; fd=9 buf=:cmd=abort exitcode=1090831
0: :
0: system msg for write_line failure : Bad file descriptor
0:
0: Program received signal SIGSEGV, Segmentation fault.
0: MPIR_Err_return_comm (comm_ptr=0x7ffff61f3a60 <_IO_stdfile_2_lock>, fcname=0x7fffffff3ff0 "system msg for write_line failure : Bad file descriptor\n", errcode=1090831) at ../../src/mpi/errhan/errutil.c:309
For my debug wrapper to work it runs gdb as a subprocess. This seems to work when using openMPI but it fails for intel MPI. I have no idea why. Is there an env variable I could set to make it work?
This was using intel-oneapi-mpi install via spack with the following settings:
intel-oneapi-mpi@2021.10.0+envmods~external-libfabric~generic-names~ilp64 build_system=generic
I am currently investigating it on my local laptop (running Ubuntu 22.04) but I have also tested on our local HPC cluster and it also doesn't work there.
Steps to re-create error:
Download and install mdb (optional but you may want to create a venv first):
git clone https://github.com/TomMelt/mdb.git
cd mdb
pip install -e .[termgraph]
Build sample c++ MPI binary:
cd examples
mpicxx -g -O0 -c simple-mpi-cpp.cpp -o simple-mpi-cpp.o
From one terminal run the launcher:
mdb launch -n 2 -t simple-mpi-cpp.exe
It will output some text, something like:
running on host: 127.0.1.1
to connect to the debugger run:
mdb attach -h 127.0.1.1 -p 2000
connecting to debuggers ... (2/2)
all debug clients connected
In another terminal copy paste the mdb attach command:
mdb attach -h 127.0.1.1 -p 2000
You should then be able to step through the code, using the command "command n". This will send the next command ("n") to all processes. The output will look something like:
mdb attach -h 127.0.1.1 -p 2000
mdb - mpi debugger - built on various backends. Type ? for more info. To exit interactive mode type "q", "quit", "Ctrl+D" or "Ctrl+]".
(mdb 0-1) command n
0: 24 var = 0.;
************************************************************************
1: 24 var = 0.;
(mdb 0-1)
0: 26 MPI_Init(NULL, NULL);
************************************************************************
1: 26 MPI_Init(NULL, NULL);
(mdb 0-1)
0: [cli_0]: write_line error; fd=9 buf=:cmd=init pmi_version=1 pmi_subversion=1
0: :
0: system msg for write_line failure : Bad file descriptor
0: [cli_0]: Unable to write to PMI_fd
0: [cli_0]: write_line error; fd=9 buf=:cmd=get_appnum
0: :
0: system msg for write_line failure : Bad file descriptor
0: Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
0: MPIR_Init_thread(176):
0: MPID_Init(1439)......:
0: MPIR_pmi_init(131)...: PMI_Get_appnum returned -1
0: [cli_0]: write_line error; fd=9 buf=:cmd=abort exitcode=1090831
0: :
0: system msg for write_line failure : Bad file descriptor
0:
0: Program received signal SIGSEGV, Segmentation fault.
0: MPIR_Err_return_comm (comm_ptr=0x7ffff61f3a60 <_IO_stdfile_2_lock>, fcname=0x7fffffff3fd0 "system msg for write_line failure : Bad file descriptor\n", errcode=1090831) at ../../src/mpi/errhan/errutil.c:309
0: 309 ../../src/mpi/errhan/errutil.c: No such file or directory.
************************************************************************
1: [cli_1]: write_line error; fd=10 buf=:cmd=init pmi_version=1 pmi_subversion=1
1: :
1: system msg for write_line failure : Bad file descriptor
1: [cli_1]: Unable to write to PMI_fd
1: [cli_1]: write_line error; fd=10 buf=:cmd=get_appnum
1: :
1: system msg for write_line failure : Bad file descriptor
1: Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
1: MPIR_Init_thread(176):
1: MPID_Init(1439)......:
1: MPIR_pmi_init(131)...: PMI_Get_appnum returned -1
1: [cli_1]: write_line error; fd=10 buf=:cmd=abort exitcode=1090831
1: :
1: system msg for write_line failure : Bad file descriptor
1:
1: Program received signal SIGSEGV, Segmentation fault.
1: MPIR_Err_return_comm (comm_ptr=0x7ffff61f3a60 <_IO_stdfile_2_lock>, fcname=0x7fffffff3fd0 "system msg for write_line failure : Bad file descriptor\n", errcode=1090831) at ../../src/mpi/errhan/errutil.c:309
1: 309 ../../src/mpi/errhan/errutil.c: No such file or directory.
(mdb 0-1)
Please let me know if you have any suggestions. Thanks for taking the time to read my query and let me know if I can provide any more information.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FWIW, I also tried running with lldb as the backend and it fails too.
To test lldb use the following launch command instead of the one above.
mdb launch -n 2 -t simple-mpi-cpp.exe -b lldb
I get a similar error:
$ mdb attach -h 127.0.1.1 -p 2000
mdb - mpi debugger - built on various backends. Type ? for more info. To exit interactive mode type "q", "quit", "Ctrl+D" or "Ctrl+]".
(mdb 0-1) command n
0: Process 85331 stopped
0: * thread #1, name = 'simple-mpi-cpp.', stop reason = step over
0: frame #0: 0x0000555555555368 simple-mpi-cpp.exe`main at simple-mpi-cpp.cpp:26:11
0: 23
0: 24 var = 0.;
0: 25
0: -> 26 MPI_Init(NULL, NULL);
0: ^
0: 27 MPI_Comm_size(MPI_COMM_WORLD, &size_of_cluster);
0: 28 MPI_Comm_rank(MPI_COMM_WORLD, &process_rank);
0: 29
************************************************************************
1: Process 85332 stopped
1: * thread #1, name = 'simple-mpi-cpp.', stop reason = step over
1: frame #0: 0x0000555555555368 simple-mpi-cpp.exe`main at simple-mpi-cpp.cpp:26:11
1: 23
1: 24 var = 0.;
1: 25
1: -> 26 MPI_Init(NULL, NULL);
1: ^
1: 27 MPI_Comm_size(MPI_COMM_WORLD, &size_of_cluster);
1: 28 MPI_Comm_rank(MPI_COMM_WORLD, &process_rank);
1: 29
(mdb 0-1)
0: [cli_0]: write_line error; fd=9 buf=:cmd=init pmi_version=1 pmi_subversion=1
0: :
0: system msg for write_line failure : Bad file descriptor
0: [cli_0]: Unable to write to PMI_fd
0: [cli_0]: write_line error; fd=9 buf=:cmd=get_appnum
0: :
0: system msg for write_line failure : Bad file descriptor
0: Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
0: MPIR_Init_thread(176):
0: MPID_Init(1439)......:
0: MPIR_pmi_init(131)...: PMI_Get_appnum returned -1
0: [cli_0]: write_line error; fd=9 buf=:cmd=abort exitcode=1090831
0: :
0: system msg for write_line failure : Bad file descriptor
0: Process 85331 stopped
0: * thread #1, name = 'simple-mpi-cpp.', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
0: frame #0: 0x00007ffff6612be1 libmpi.so.12`MPIR_Err_return_comm(comm_ptr=0x00007ffff61f3a60, fcname="system msg for write_line failure : Bad file descriptor\n", errcode=1090831) at errutil.c:309
************************************************************************
1: [cli_1]: write_line error; fd=10 buf=:cmd=init pmi_version=1 pmi_subversion=1
1: :
1: system msg for write_line failure : Bad file descriptor
1: [cli_1]: Unable to write to PMI_fd
1: [cli_1]: write_line error; fd=10 buf=:cmd=get_appnum
1: :
1: system msg for write_line failure : Bad file descriptor
1: Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
1: MPIR_Init_thread(176):
1: MPID_Init(1439)......:
1: MPIR_pmi_init(131)...: PMI_Get_appnum returned -1
1: [cli_1]: write_line error; fd=10 buf=:cmd=abort exitcode=1090831
1: :
1: system msg for write_line failure : Bad file descriptor
1: Process 85332 stopped
1: * thread #1, name = 'simple-mpi-cpp.', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
1: frame #0: 0x00007ffff6612be1 libmpi.so.12`MPIR_Err_return_comm(comm_ptr=0x00007ffff61f3a60, fcname="system msg for write_line failure : Bad file descriptor\n", errcode=1090831) at errutil.c:309
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI, I decided to check mpich as well.
mpich initially had the same problem and it appears it is related to https://github.com/pmodels/mpich/issues/2063.
I managed to get mpich working by adding the flag "--pmi-port".
mpich now works, but I cannot find a similar flag for intel mpi. Do you have any ideas?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page