Re: Re:Spawning fortran executable with mpi4py in python fails

O_Rourke__Conn · ‎11-27-2020

I have an mpi4py python code that spawns a Fortran executable.

The code prceeds happily enough, spawning and disconnecting from the Fortran child - however occasionally the code eventually fails with the followinng error

Abort(3188623) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)..........:
MPID_Init(958).................:
MPIDI_OFI_mpi_init_hook(1499)..:
MPID_Comm_connect(250).........:
MPIDI_OFI_mpi_comm_connect(655):
dynproc_exchange_map(534)......:
(unknown)(): Other MPI error

'm unable to find out much about what this error means and why it happens, but it happens when attempting a spawn.

Has anyone seen this error before using mpi4py and know why it might happen?

I'm using the intel mpi and compilers (Parallel studio XE cluster: intel_2020/compilers_and_libraries_2020.0.166), python 3.6.9 and mpi4py 3.0.3.

Thanks

PrasanthD_intel · ‎12-01-2020

Hi Conn,

There isn't much to infer from the debug info you have provided except where the error is originated from.

I have tried to spawn multiple python executables but haven't faced any error.

Could you please provide us with a sample reproducer (python and Fortran codes)? along with the command line. That would help us a lot.

Regards

Prasanth

O_Rourke__Conn · ‎12-03-2020

Hi Prasanth,

Thanks for the reply.

I've attached a dummy code to reproduce the issue.

compile test_executable.f90 :

mpif90 test_executable.f90 -o test_executable

and run run_test.py in the same folder as:

mpirun -np 1 python3 run_test.py

Thanks,

Conn

O_Rourke__Conn · ‎12-03-2020

Hi Prasanth,

Here is a simpler version without any data being passed back and forth between the child and parent that produces the error:

hello.f90:

program hello
implicit none
include 'mpif.h'

integer::rank, size, ierr
integer::mpi_comm_parent

call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call mpi_COMM_get_parent(mpi_comm_parent,ierr)
print*,"hello from spawned child",rank
IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
CALL MPI_BARRIER(mpi_comm_parent,ierr)
CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
end if

call MPI_FINALIZE(ierr)
end

and the python runner:

#! /usr/bin/env python3

from mpi4py import MPI
import sys
import numpy as np

my_comm = MPI.COMM_WORLD
my_rank = MPI.COMM_WORLD.Get_rank()
size = my_comm.Get_size()

if __name__ == "__main__":

executable = "./hello"

for i in range(2000):
print("Spawning",i)
commspawn = MPI.COMM_SELF.Spawn(executable, args="", maxprocs=4)
commspawn.Barrier()
commspawn.Disconnect()
sys.stdout.flush()

MPI.COMM_WORLD.Barrier()
MPI.Finalize()

The error doesn't seem to happen systematically at any particular step either. Error message:

Abort(3188623) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)..........:
MPID_Init(958).................:
MPIDI_OFI_mpi_init_hook(1499)..:
MPID_Comm_connect(250).........:
MPIDI_OFI_mpi_comm_connect(655):
dynproc_exchange_map(534)......:
(unknown)(): Other MPI error
[mpiexec@chsv-beryl] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@chsv-beryl] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:171): error sending cmd 15 to proxy
[mpiexec@chsv-beryl] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:551): unable to send response downstream
[mpiexec@chsv-beryl] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1601): unable to send abort rank to downstreams
[mpiexec@chsv-beryl] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[mpiexec@chsv-beryl] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2007): error waiting for event

O_Rourke__Conn · ‎12-03-2020

Sorry terrible formatting there:

   program hello
    implicit none
   include 'mpif.h'

   integer::rank, size, ierr
   integer::mpi_comm_parent

       call MPI_INIT(ierr)
       call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
       call mpi_COMM_get_parent(mpi_comm_parent,ierr)
        print*,"hello from spawned child",rank
        IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
            CALL MPI_BARRIER(mpi_comm_parent,ierr)
            CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
        end if

   call MPI_FINALIZE(ierr)
   end

#! /usr/bin/env python3

from mpi4py import MPI
import sys
import numpy as np

my_comm = MPI.COMM_WORLD
my_rank = MPI.COMM_WORLD.Get_rank()
size = my_comm.Get_size()

if __name__ == "__main__":


    executable = "./hello"

    for i in range(2000):
        print("Spawning",i)
        commspawn = MPI.COMM_SELF.Spawn(executable, args="", maxprocs=4)#, info=mpi_info)
        commspawn.Barrier()
        commspawn.Disconnect()
        sys.stdout.flush()

    MPI.COMM_WORLD.Barrier()
    MPI.Finalize()
~

PrasanthD_intel · ‎12-08-2020

Hi Conn,

I think this error is due to the process limitations of the user in Linux.

I have reproduced a similar error and when i reduce the total number of spawned processes to 330 i am able to spawn without any errors in a single node.

Could you please check at what number you were getting this error? also, the output of ulimit -a and MPI version you were using?

Regards

Prasanth

PrasanthD_intel · ‎12-14-2020

Hi Conn,

We haven't heard back from you. Had you checked with a lower number of spawned processes?

Let us know the results.

Regards

Prasanth

O_Rourke__Conn · ‎12-14-2020

Hi Prasanth,

Yes - it is possible to spawn several instances of the executable. But it eventually results in an error.

Is there a way to remove the limit?

The output from ulimit is:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 513221
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 65536
cpu time               (seconds, -t) unlimited
max user processes              (-u) 513221
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

and I am using Intel MPI:

Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024 (id: 082ae5608)

Thanks,

Conn

O_Rourke__Conn · ‎12-14-2020

Hi Again,

Its also worth noting that there is a barrier preventing multiple instances of the exectuable spawning at the same time - so there shouldnt be and excessive number of processes running at once.

Thanks,

Conn

PrasanthD_intel · ‎12-18-2020

Hi Conn,

As you have said there shouldn't be multiple processes running at the same time and it is the case. As there is no dependency between threads, they will terminate irrespective of others.

In the previous replies, I mentioned this might be due to too many open processes but as I checked that is not the case.

I am now transferring this case to the concerned team who can debug and answer better. We will get back to you soon. Thanks for your patience.

Regards

Prasanth

DrAmarpal_K_Intel · ‎06-02-2021

Hi O_Rourke__Conn,

Thanks for sharing a reproducer. I experienced an error but not exactly the same one that you see at 1008th iteration (this number changes from run to run but remains in the neighborhood of 1000). Here's mine, with Intel Fortran compiler 2021.2, Intel MPI Library 2021.2 and Intel Distribution for Python 3.7.9,

[mpiexec@s001-n061] enqueue_control_fd (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:72): assert (!closed) failed

[mpiexec@s001-n061] local_launch (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:546): error enqueuing control fd

[mpiexec@s001-n061] single_launch (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:635): local launch error

[mpiexec@s001-n061] launch_bstrap_proxies (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:825): single launch error

[mpiexec@s001-n061] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1019): unable to launch bstrap proxy

[mpiexec@s001-n061] do_spawn (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1138): error setting up the bootstrap proxies

It turns out that every spawn leaves behind a hydra_pmi_proxy running on the machine. Eventually, a point comes when the number of PIDs on the machine exceeds the limit for the maximum allowed PIDs for a user, and when this happens new bootstrap proxies can no longer be launched resulting in the above error.

On my machine,

$ ulimit -a | grep "max user processes"

max user processes (-u) 1024

And when the code runs the 1008th iteration of the loop, the number of PIDs approach 1024 and the above error comes up. I track the number of running PIDs using the following command,

$ watch -n 0.5 "ps -a | wc -l"

The error that you originally reported might also be related to the max user processes limit.

Best regards,

Amar

DrAmarpal_K_Intel · ‎06-18-2021

Hi O_Rourke__Conn,

Please let me know if you had further questions.

Best regards,

Amar

O_Rourke__Conn · ‎06-18-2021

Hi Amar,

I expect you are right - I haven't looked at this in a while, as I refactored the code I was working on, but I do recall seeing hyrda_mpi_proxies lying around.

For future reference is there a way to kill them off?

Thanks,

Conn

DrAmarpal_K_Intel · ‎06-25-2021

Hi O_Rourke__Conn,

I am checking why the runtime doesn't already do this. Allow me some time to get back to you.

Best regards,

Amar

DrAmarpal_K_Intel · ‎07-02-2021

Hi O_Rourke__Conn,

Could you please try launching your application with the following environment variables and report your findings?

I_MPI_SPAWN_EXPERIMENTAL=1

I_MPI_SPAWN=1

Best regards,

Amar

DrAmarpal_K_Intel · ‎07-22-2021

Hi O_Rourke__Conn,

Just wanted to check if you had any updates?

Best regards,

Amar

DrAmarpal_K_Intel · ‎07-29-2021

Closing this thread due to inactivity. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.