- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The following simple program invokes MPI_Comm_spawn with maxprocs=16 independently on each rank. I have tested it and it works with Intel MPI 18.0.5 with up to 64 ranks. However, with Intel MPI 19.0.9 and more than 16 ranks, the following error occurs:
[proxy:0:2@c161-004.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:2@c161-004.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:2@c161-004.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:6@c161-014.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:6@c161-014.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:6@c161-014.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
Any help is much appreciated!
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int world_size, world_rank;
MPI_Comm sub_comm; /* intercommunicator */
static char worker_program[100] = "/bin/bash";
static char *worker_arguments[] = { "-c", "echo spawned bash child", NULL };
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
printf("I am rank %d of %d\n", world_rank, world_size);
fflush(stdout);
int nprocs_per_worker = 16;
int status;
status = MPI_Comm_spawn(worker_program, worker_arguments,
nprocs_per_worker,
MPI_INFO_NULL, 0, MPI_COMM_SELF, &sub_comm,
MPI_ERRCODES_IGNORE);
printf("status = %d\n", status);
fflush(stdout);
MPI_Finalize();
return 0;
}
- Tags:
- spawn
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case anyone is experiencing the same issue, I found that a possible workaround is to set I_MPI_HYDRA_BRANCH_COUNT=0. This increases the startup time when tens of ranks independently spawn large numbers of child processes, but it seems to eliminate the occurrence of the error above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ivan,
I have tried to reproduce the issue haven't faced any error, but the program hangs after spawning some processes.
I have replaced the spawned program with a sample mpi test program, and it run perfectly without any errors.
I will contact the internal team regarding this issue and will get back to you soon.
Meanwhile could you please provide us the logs after keeping I_MPI_DEBUG=10 while running your code?
eg: I_MPI_DEBUG=10 mpiexec.hydra -n 20 ./spawn or export I_MPI_DEBUG=10
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @PrasanthD_intel thank you very much for your response. Apologies for the hanging test; I was trying to come up with a narrow for spawn only, and I simplified the reproducer too much. Anyway, I realized that I get the error only when -rr is specified:
mpirun -rr -n 24 test_mpi_spawn
(The same error occurs with mpiexec.hydra).
The reason for -rr is that otherwise the spawned processes end up on the same node. Without -rr, I do not get the spawn error, but the spawned processes are not distributed across hosts. I am attaching the log when running with I_MPI_DEBUG=10. Thank you very much for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Somehow the attachment did not make it, so here is the pasted log:
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
I am rank 17 of 24
I am rank 3 of 24
I am rank 9 of 24
I am rank 20 of 24
I am rank 11 of 24
I am rank 18 of 24
I am rank 15 of 24
I am rank 5 of 24
I am rank 12 of 24
I am rank 7 of 24
I am rank 23 of 24
I am rank 16 of 24
I am rank 22 of 24
I am rank 21 of 24
I am rank 19 of 24
I am rank 1 of 24
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
srun: error: c161-161: task 1: Exited with exit code 5
srun: error: c161-153: task 0: Exited with exit code 5
srun: error: c161-173: task 4: Exited with exit code 5
srun: error: c161-163: task 2: Exited with exit code 5
srun: error: c161-181: task 5: Exited with exit code 5
srun: error: c161-191: task 7: Exited with exit code 5
srun: error: c161-171: task 3: Exited with exit code 5
srun: error: c161-183: task 6: Exited with exit code 5
I am rank 2 of 24
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like my post did not include the actual attachment, so here is the log:
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
I am rank 17 of 24
I am rank 3 of 24
I am rank 9 of 24
I am rank 20 of 24
I am rank 11 of 24
I am rank 18 of 24
I am rank 15 of 24
I am rank 5 of 24
I am rank 12 of 24
I am rank 7 of 24
I am rank 23 of 24
I am rank 16 of 24
I am rank 22 of 24
I am rank 21 of 24
I am rank 19 of 24
I am rank 1 of 24
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
srun: error: c161-161: task 1: Exited with exit code 5
srun: error: c161-153: task 0: Exited with exit code 5
srun: error: c161-173: task 4: Exited with exit code 5
srun: error: c161-163: task 2: Exited with exit code 5
srun: error: c161-181: task 5: Exited with exit code 5
srun: error: c161-191: task 7: Exited with exit code 5
srun: error: c161-171: task 3: Exited with exit code 5
srun: error: c161-183: task 6: Exited with exit code 5
I am rank 2 of 24
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please set FI_MLX_ENABLE_SPAWN=yes and try again. This is needed to enable dynamic process management with the mlx provider, see https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-library-2019-over-libfabric.html.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Due to lack of replies, this case is being closed for Intel support. Any further replies on this thread will be considered community only.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page