- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an mpi4py python code that spawns a Fortran executable.
The code prceeds happily enough, spawning and disconnecting from the Fortran child - however occasionally the code eventually fails with the followinng error
Abort(3188623) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)..........:
MPID_Init(958).................:
MPIDI_OFI_mpi_init_hook(1499)..:
MPID_Comm_connect(250).........:
MPIDI_OFI_mpi_comm_connect(655):
dynproc_exchange_map(534)......:
(unknown)(): Other MPI error
'm unable to find out much about what this error means and why it happens, but it happens when attempting a spawn.
Has anyone seen this error before using mpi4py and know why it might happen?
I'm using the intel mpi and compilers (Parallel studio XE cluster: intel_2020/compilers_and_libraries_2020.0.166), python 3.6.9 and mpi4py 3.0.3.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Conn,
There isn't much to infer from the debug info you have provided except where the error is originated from.
I have tried to spawn multiple python executables but haven't faced any error.
Could you please provide us with a sample reproducer (python and Fortran codes)? along with the command line. That would help us a lot.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
Thanks for the reply.
I've attached a dummy code to reproduce the issue.
compile test_executable.f90 :
mpif90 test_executable.f90 -o test_executable
and run run_test.py in the same folder as:
mpirun -np 1 python3 run_test.py
Thanks,
Conn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
Here is a simpler version without any data being passed back and forth between the child and parent that produces the error:
hello.f90:
program hello
implicit none
include 'mpif.h'
integer::rank, size, ierr
integer::mpi_comm_parent
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call mpi_COMM_get_parent(mpi_comm_parent,ierr)
print*,"hello from spawned child",rank
IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
CALL MPI_BARRIER(mpi_comm_parent,ierr)
CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
end if
call MPI_FINALIZE(ierr)
end
and the python runner:
#! /usr/bin/env python3
from mpi4py import MPI
import sys
import numpy as np
my_comm = MPI.COMM_WORLD
my_rank = MPI.COMM_WORLD.Get_rank()
size = my_comm.Get_size()
if __name__ == "__main__":
executable = "./hello"
for i in range(2000):
print("Spawning",i)
commspawn = MPI.COMM_SELF.Spawn(executable, args="", maxprocs=4)
commspawn.Barrier()
commspawn.Disconnect()
sys.stdout.flush()
MPI.COMM_WORLD.Barrier()
MPI.Finalize()
The error doesn't seem to happen systematically at any particular step either. Error message:
Abort(3188623) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)..........:
MPID_Init(958).................:
MPIDI_OFI_mpi_init_hook(1499)..:
MPID_Comm_connect(250).........:
MPIDI_OFI_mpi_comm_connect(655):
dynproc_exchange_map(534)......:
(unknown)(): Other MPI error
[mpiexec@chsv-beryl] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@chsv-beryl] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:171): error sending cmd 15 to proxy
[mpiexec@chsv-beryl] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:551): unable to send response downstream
[mpiexec@chsv-beryl] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1601): unable to send abort rank to downstreams
[mpiexec@chsv-beryl] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[mpiexec@chsv-beryl] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2007): error waiting for event
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry terrible formatting there:
program hello
implicit none
include 'mpif.h'
integer::rank, size, ierr
integer::mpi_comm_parent
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call mpi_COMM_get_parent(mpi_comm_parent,ierr)
print*,"hello from spawned child",rank
IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
CALL MPI_BARRIER(mpi_comm_parent,ierr)
CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
end if
call MPI_FINALIZE(ierr)
end
#! /usr/bin/env python3
from mpi4py import MPI
import sys
import numpy as np
my_comm = MPI.COMM_WORLD
my_rank = MPI.COMM_WORLD.Get_rank()
size = my_comm.Get_size()
if __name__ == "__main__":
executable = "./hello"
for i in range(2000):
print("Spawning",i)
commspawn = MPI.COMM_SELF.Spawn(executable, args="", maxprocs=4)#, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()
sys.stdout.flush()
MPI.COMM_WORLD.Barrier()
MPI.Finalize()
~
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Conn,
I think this error is due to the process limitations of the user in Linux.
I have reproduced a similar error and when i reduce the total number of spawned processes to 330 i am able to spawn without any errors in a single node.
Could you please check at what number you were getting this error? also, the output of ulimit -a and MPI version you were using?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Conn,
We haven't heard back from you. Had you checked with a lower number of spawned processes?
Let us know the results.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
Yes - it is possible to spawn several instances of the executable. But it eventually results in an error.
Is there a way to remove the limit?
The output from ulimit is:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 513221
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 65536
cpu time (seconds, -t) unlimited
max user processes (-u) 513221
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
and I am using Intel MPI:
Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024 (id: 082ae5608)
Thanks,
Conn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Again,
Its also worth noting that there is a barrier preventing multiple instances of the exectuable spawning at the same time - so there shouldnt be and excessive number of processes running at once.
Thanks,
Conn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Conn,
As you have said there shouldn't be multiple processes running at the same time and it is the case. As there is no dependency between threads, they will terminate irrespective of others.
In the previous replies, I mentioned this might be due to too many open processes but as I checked that is not the case.
I am now transferring this case to the concerned team who can debug and answer better. We will get back to you soon. Thanks for your patience.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi O_Rourke__Conn,
Thanks for sharing a reproducer. I experienced an error but not exactly the same one that you see at 1008th iteration (this number changes from run to run but remains in the neighborhood of 1000). Here's mine, with Intel Fortran compiler 2021.2, Intel MPI Library 2021.2 and Intel Distribution for Python 3.7.9,
[mpiexec@s001-n061] enqueue_control_fd (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:72): assert (!closed) failed
[mpiexec@s001-n061] local_launch (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:546): error enqueuing control fd
[mpiexec@s001-n061] single_launch (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:635): local launch error
[mpiexec@s001-n061] launch_bstrap_proxies (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:825): single launch error
[mpiexec@s001-n061] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1019): unable to launch bstrap proxy
[mpiexec@s001-n061] do_spawn (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1138): error setting up the bootstrap proxies
It turns out that every spawn leaves behind a hydra_pmi_proxy running on the machine. Eventually, a point comes when the number of PIDs on the machine exceeds the limit for the maximum allowed PIDs for a user, and when this happens new bootstrap proxies can no longer be launched resulting in the above error.
On my machine,
$ ulimit -a | grep "max user processes"
max user processes (-u) 1024
And when the code runs the 1008th iteration of the loop, the number of PIDs approach 1024 and the above error comes up. I track the number of running PIDs using the following command,
$ watch -n 0.5 "ps -a | wc -l"
The error that you originally reported might also be related to the max user processes limit.
Best regards,
Amar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi O_Rourke__Conn,
Please let me know if you had further questions.
Best regards,
Amar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Amar,
I expect you are right - I haven't looked at this in a while, as I refactored the code I was working on, but I do recall seeing hyrda_mpi_proxies lying around.
For future reference is there a way to kill them off?
Thanks,
Conn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi O_Rourke__Conn,
I am checking why the runtime doesn't already do this. Allow me some time to get back to you.
Best regards,
Amar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi O_Rourke__Conn,
Could you please try launching your application with the following environment variables and report your findings?
I_MPI_SPAWN_EXPERIMENTAL=1
I_MPI_SPAWN=1
Best regards,
Amar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi O_Rourke__Conn,
Just wanted to check if you had any updates?
Best regards,
Amar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Closing this thread due to inactivity. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page