Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Ivan_Raikov
Beginner
192 Views

Error with concurrent invocation of MPI_Comm_spawn and Intel MPI 19.x

Hello,

The following simple program invokes MPI_Comm_spawn with maxprocs=16 independently on each rank. I have tested it and it works with Intel MPI 18.0.5 with up to 64 ranks. However, with Intel MPI 19.0.9 and more than 16 ranks, the following error occurs:

 

[proxy:0:2@c161-004.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:2@c161-004.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:2@c161-004.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:6@c161-014.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:6@c161-014.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:6@c161-014.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event

Any help is much appreciated!

#include <stdio.h>
#include <string.h>

#include <mpi.h>

int main(int argc, char *argv[]) 
{ 
  int world_size, world_rank;
  MPI_Comm sub_comm;           /* intercommunicator */ 
  static char worker_program[100] = "/bin/bash"; 
  static char *worker_arguments[] = { "-c", "echo spawned bash child", NULL };

  MPI_Init(&argc, &argv); 
  MPI_Comm_size(MPI_COMM_WORLD, &world_size); 
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); 

  printf("I am rank %d of %d\n", world_rank, world_size);
  fflush(stdout);

  int nprocs_per_worker = 16;

  int status;
  status = MPI_Comm_spawn(worker_program, worker_arguments,
                          nprocs_per_worker,  
                          MPI_INFO_NULL, 0, MPI_COMM_SELF, &sub_comm,  
                          MPI_ERRCODES_IGNORE); 
  printf("status = %d\n", status);
  fflush(stdout);
  MPI_Finalize(); 
  return 0; 
} 

 

 

Labels (1)
Tags (1)
0 Kudos
6 Replies
Ivan_Raikov
Beginner
170 Views

In case anyone is experiencing the same issue, I found that a possible workaround is to set I_MPI_HYDRA_BRANCH_COUNT=0. This increases the startup time when tens of ranks independently spawn large numbers of child processes, but it seems to eliminate the occurrence of the error above.

PrasanthD_intel
Moderator
159 Views

Hi Ivan,


I have tried to reproduce the issue haven't faced any error, but the program hangs after spawning some processes.

I have replaced the spawned program with a sample mpi test program, and it run perfectly without any errors.

I will contact the internal team regarding this issue and will get back to you soon.

Meanwhile could you please provide us the logs after keeping I_MPI_DEBUG=10 while running your code?

eg: I_MPI_DEBUG=10 mpiexec.hydra -n 20 ./spawn or export I_MPI_DEBUG=10


Regards

Prasanth


Ivan_Raikov
Beginner
150 Views

Hi @PrasanthD_intel thank you very much for your response. Apologies for the hanging test; I was trying to come up with a narrow for spawn only, and I simplified the reproducer too much. Anyway, I realized that I get the error only when -rr is specified:

mpirun -rr -n 24 test_mpi_spawn

(The same error occurs with mpiexec.hydra).

The reason for -rr is that otherwise the spawned processes end up on the same node. Without -rr, I do not get the spawn error, but the spawned processes are not distributed across hosts. I am attaching the log when running with I_MPI_DEBUG=10. Thank you very much for your help!

 

Ivan_Raikov
Beginner
140 Views

Somehow the attachment did not make it, so here is the pasted log:

 

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9  Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
I am rank 17 of 24
I am rank 3 of 24
I am rank 9 of 24
I am rank 20 of 24
I am rank 11 of 24
I am rank 18 of 24
I am rank 15 of 24
I am rank 5 of 24
I am rank 12 of 24
I am rank 7 of 24
I am rank 23 of 24
I am rank 16 of 24
I am rank 22 of 24
I am rank 21 of 24
I am rank 19 of 24
I am rank 1 of 24
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
srun: error: c161-161: task 1: Exited with exit code 5
srun: error: c161-153: task 0: Exited with exit code 5
srun: error: c161-173: task 4: Exited with exit code 5
srun: error: c161-163: task 2: Exited with exit code 5
srun: error: c161-181: task 5: Exited with exit code 5
srun: error: c161-191: task 7: Exited with exit code 5
srun: error: c161-171: task 3: Exited with exit code 5
srun: error: c161-183: task 6: Exited with exit code 5
I am rank 2 of 24
Ivan_Raikov
Beginner
143 Views

It looks like my post did not include the actual attachment, so here is the log:

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9  Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
I am rank 17 of 24
I am rank 3 of 24
I am rank 9 of 24
I am rank 20 of 24
I am rank 11 of 24
I am rank 18 of 24
I am rank 15 of 24
I am rank 5 of 24
I am rank 12 of 24
I am rank 7 of 24
I am rank 23 of 24
I am rank 16 of 24
I am rank 22 of 24
I am rank 21 of 24
I am rank 19 of 24
I am rank 1 of 24
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:0@c161-153.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:2@c161-161.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:8@c161-173.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:4@c161-163.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:10@c161-181.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:6@c161-171.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:14@c161-191.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] proxy_downstream_control_cb (../../../../../src/pm/i_hydra/proxy/proxy_cb.c:592): received unknown cmd 22
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[proxy:0:12@c161-183.frontera.tacc.utexas.edu] main (../../../../../src/pm/i_hydra/proxy/proxy.c:978): error waiting for event
srun: error: c161-161: task 1: Exited with exit code 5
srun: error: c161-153: task 0: Exited with exit code 5
srun: error: c161-173: task 4: Exited with exit code 5
srun: error: c161-163: task 2: Exited with exit code 5
srun: error: c161-181: task 5: Exited with exit code 5
srun: error: c161-191: task 7: Exited with exit code 5
srun: error: c161-171: task 3: Exited with exit code 5
srun: error: c161-183: task 6: Exited with exit code 5
I am rank 2 of 24
James_T_Intel
Moderator
123 Views

Please set FI_MLX_ENABLE_SPAWN=yes and try again. This is needed to enable dynamic process management with the mlx provider, see https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-library-2019-over-libfabric.....


Reply