- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130522 on a Linux Cluster environment. Running the following script will produce a race condition. All used libraries are compiled against this MPI library.
[python]
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
world_group = comm.Get_group()
my_group = world_group.Incl([rank])
my_comm = comm.Create(my_group)
intercomm = my_comm.Spawn("./script.sh", [], 1, MPI.INFO_NULL, 0)
[/python]
Randomly occuring error:
[mpiexec@capp1] control_cb (./pm/pmiserv/pmiserv_cb.c:715): assert (!closed) failed
[mpiexec@capp1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@capp1] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:430): error waiting for event
[mpiexec@capp1] main (./ui/mpich/mpiexec.c:847): process manager error waiting for completion
A possible workaround is given by serialization of the MPI_Comm_spawn calls:
[python]
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
world_group = comm.Get_group()
my_group = world_group.Incl([rank])
my_comm = comm.Create(my_group)
for r in range(0, size):
if rank == r:
intercomm = my_comm.Spawn("script.sh", [], 1, MPI.INFO_NULL, 0)
comm.Barrier()
[/python]
It will also dissappear if comm is used to call MPI_Comm_spawn collectively, which will also induce mandatory serialization. A completely parallel call is not reliably possible.
It seems to me that the process manager is not multi process safe in this case. In my opionion and understanding this use scenario should be possible according to the MPI standard. I tested this behaviour for the shm and rdma fabric.
Thank you very much for your input!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for the message.
Unfortunately Python* is not supported by Intel(R) MPI Library.
Also I can't reproduce the issue:
-bash-4.1$ mpiexec.hydra -n 2 python /tmp/mpi4py-1.3.1/demo/helloworld.py
Hello, World! I am process 0 of 2 on host1.
Hello, World! I am process 1 of 2 on host1.
-bash-4.1$ mpiexec.hydra -n 2 python ./spawn.py
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
-bash-4.1$ mpiexec.hydra -n 2 python ./spawn_serial.py
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
The error message from your post indicates that the connection between pmi_proxy and mpiexec.hydra wasn't established. It can be, for example, due to file descriptors exhausting as mentioned in your topic "Erroneus [pmi_proxy]
Please provide reproducers in supported language: C, C++, Fortran 77, Fortran 95 (see Intel(R) MPI Library 4.1 Update 1 for Linux* OS Release Notes).
--
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm experiencing the same problem with my fortran mpi program. ( I attached the source code)
In the driver (driver.f90), every task does a MPI_COMM_SPAWN(...,MPI_COMM_SELF,...) which calls a simple mpi hello world program. The driver tasks then wait until the hello world jobs finish, before continuing.
I'm using Intel MPI 4.1.3.049 and ifort 13.1.3, for compiling I use O0 (for debugging purposes)
I execute using mpirun -np <X> ./driver.x (and I vary X between 2 and 8)
Some times it executes fine, some times I receive the following error message (it seems pretty random, although it happens more often when I set number of tasks to 8):
================================================
mpirun -np 4 ./driver.x
Starting, task 2
ARGS(1): 2
Starting, task 0
Starting, task 1
Starting, task 3
ARGS(1): 0
ARGS(1): 1
ARGS(1): 3
[mpiexec@login4] control_cb (./pm/pmiserv/pmiserv_cb.c:717): assert (!closed) failed
[mpiexec@login4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@login4] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:435): error waiting for event
[mpiexec@login4] main (./ui/mpich/mpiexec.c:901): process manager error waiting for completion
==================================================
Any suggestion what the problem might be?
Thanks!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page