Pinning of processes spawned with MPI_Comm_spawn

John_D_6 · ‎10-31-2016

(somehow, my previous post 700216 was initially in a draft state, then got published, but didn't appear on the mailing list)

Hi,

the I_MPI_PIN_* variables can be used to set to pretty much any cpu-mask for the MPI-ranks that are used. Unfortunately, the Intel MPI library doesn't set the mask correctly for processes that are dynamically spawned.

Here's an example to show the problem:

program mpispawn
  use mpi
  implicit none

  integer ierr,errcodes(1),intercomm,pcomm,mpisize,dumm,rank
  character(1000) cmd
  logical master

  call MPI_Init(ierr)
  call get_command_argument(0,cmd)
  print*,'cmd=',trim(cmd)
  call MPI_Comm_get_parent(pcomm,ierr)
  if (pcomm.eq.MPI_COMM_NULL) then
    print*,'I am the master. Clone myself!'
    master=.true.
    call MPI_Comm_spawn(cmd,MPI_ARGV_NULL,4,MPI_INFO_NULL,0,MPI_COMM_WORLD,pcomm,errcodes,ierr)
    call MPI_Comm_size(pcomm,mpisize,ierr)
    print*,'Processes in intercommunicator:',mpisize
    dumm=88
    call MPI_Bcast(dumm,1,MPI_INTEGER,MPI_ROOT,pcomm,ierr)
  else
    print*,'I am a clone. Use me'
    master=.false.
    call MPI_Bcast(dumm,1,MPI_INTEGER,0,pcomm,ierr)
  endif
  call MPI_Comm_rank(pcomm,rank,ierr)
  print*,'rank,master,dumm=',rank,master,dumm
  call sleep(300)
  call MPI_Barrier(pcomm,ierr)
  call MPI_Finalize(ierr)
end

I run this example on 2 nodes, each with 2 8-core CPUs. I request core binding and domains with scattered ordering and 3 processes per node (so ranks are bound round-robin to the sockets). mpirun starts 2 MPI-processes, and these spawn 4 further MPI-processes:

[donners@int1 mpispawn]$ I_MPI_DEBUG=4 mpirun -n 2 -hosts "int1,int2" -ppn 3 -binding "pin=yes;cell=core;domain=1;order=scatter" ./mpi.impi 
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name                   Pin cpu
[0] MPI startup(): 0       31036    int1.cartesius.surfsara.nl  {0}
[0] MPI startup(): 1       31037    int1.cartesius.surfsara.nl  {8}
 cmd=./mpi.impi
 I am the master. Clone myself!
 cmd=./mpi.impi
 I am the master. Clone myself!
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): reinitialization: shm and dapl data transfer modes
[1] MPI startup(): reinitialization: shm and dapl data transfer modes
[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[3] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[2] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[3] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[3] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
 Processes in intercommunicator:           2
 rank,master,dumm=           1 T          88
 Processes in intercommunicator:           2
 rank,master,dumm=           0 T          88
[0] MPI startup(): Rank    Pid      Node name                   Pin cpu
[0] MPI startup(): 0       31045    int1.cartesius.surfsara.nl  {0}
[0] MPI startup(): 1       24519    int2.cartesius.surfsara.nl  {0}
[0] MPI startup(): 2       24520    int2.cartesius.surfsara.nl  {8}
[0] MPI startup(): 3       24521    int2.cartesius.surfsara.nl  {1}
 cmd=./mpi.impi
 I am a clone. Use me
 rank,master,dumm=           0 F          88
 cmd=./mpi.impi
 I am a clone. Use me
 cmd=./mpi.impi
 I am a clone. Use me
 cmd=./mpi.impi
 I am a clone. Use me
 rank,master,dumm=           1 F          88
 rank,master,dumm=           2 F          88
 rank,master,dumm=           3 F          88

The 2 initial processes are bound correctly, each round-robin to the first core of each sockets on the first node. However, the first dynamically spawned rank is also bound to the first core, but it seems that this should have been the second core. Now it competes with an initial process for the same core. Note that the processes that were dynamically spawned, do get distributed correctly across nodes. Also the binding on the second node is correct.

What can be done to bind all dynamically spawned processes correctly?