Mpi_comm_spawn with large number of children hangs at mpi_init

Berger__Philippe · ‎06-13-2018

Hello,

I have a Fortran 90 mpi program running in on a linux cluster, with intel/2018.0.2 and intelmpi/2018.0.2 compilers, which uses MPI_COMM_SPAWN to spawn 1 child process of a C++ mpi program per parent process. The idea is that the parent processes are mapped evenly across the nodes, each of which spawns a child, waits for a blocking send/recv from it to signal completion, and then goes on to work with the output of the child.

Here is the command I use to spawn call the children:

call MPI_COMM_SPAWN('MUSIC', argv, 1, info, 0, &
        MPI_COMM_SELF, MPI_COMM_CHILD, MPI_ERRCODES_IGNORE, ierr)

So maxprocs=1 process is spawned by each parent, using its own communicator, concurrently by all the parent processes (or whenever they reach this call).

I have tested the code and it works for 8 processes (8 parent + 8 child = 16 total), spread over 2 nodes. I'm now trying to scale up to 128 processes spread over 32 nodes, but all of the children processes are hanging (I think) at Mpi_Init(). I can top on the nodes and see that they (the correct number of them) are running, so they have been spawned, but aren't progressing through the program.

Here is the tail of stdout with I_MPI_DEBUG=10:

[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx5_0-1u
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx5_0-1u
[0] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[0] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[0] MPI startup(): DAPL provider ofa-v2-mlx5_0-1u
[0] MPI startup(): DAPL provider ofa-v2-mlx5_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPI startup(): DAPL provider ofa-v2-mlx5_0-1u
[0] MPI startup(): DAPL provider ofa-v2-mlx5_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000

This is what is suggests to me the children are hanging at either startup or Mpi_Init(), since these are some, but not all of the "MPI startup():" messages they should produce with a successful start up. By inspection of the successful start ups, after the above, there should be some messages about the cores on each node and then:

[0] MPI startup(): I_MPI_INFO_CACHE3=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,\
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_CACHES=3
[0] MPI startup(): I_MPI_INFO_CACHE_SHARE=2,2,64
[0] MPI startup(): I_MPI_INFO_CACHE_SIZE=32768,1048576,28835840
[0] MPI startup(): I_MPI_INFO_CORE=0,1,2,3,4,8,9,10,11,12,16,17,18,19,20,24,25,26,27,28,0,1,2,3,4,8,9,10,11,12,16\
,17,18,19,20,24,25,26,27,28,0,1,2,3,4,8,9,10,11,12,16,17,18,19,20,24,25,26,27,28,0,1,2,3,4,8,9,10,11,12,16,17,18,\
19,20,24,25,26,27,28
[0] MPI startup(): I_MPI_INFO_C_NAME=Unknown
[0] MPI startup(): I_MPI_INFO_DESC=1342177280
[0] MPI startup(): I_MPI_INFO_FLGB=-744488965
[0] MPI startup(): I_MPI_INFO_FLGC=2147417079
[0] MPI startup(): I_MPI_INFO_FLGCEXT=8
[0] MPI startup(): I_MPI_INFO_FLGD=-1075053569
[0] MPI startup(): I_MPI_INFO_FLGDEXT=201326592
[0] MPI startup(): I_MPI_INFO_LCPU=80
[0] MPI startup(): I_MPI_INFO_MODE=775
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_INFO_PACK=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,\
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_SIGN=329300
[0] MPI startup(): I_MPI_INFO_STATE=0
[0] MPI startup(): I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,\
0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_VEND=1
[0] MPI startup(): I_MPI_PIN_INFO=x0,1,2,3,4,5,6,7,8,9,40,41,42,43,44,45,46,47,48,49
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 10,2 20,3 30

which are the last messages produced by the successful start up of the parent processes (similarly by the children in the 8 process case).

There is another thread https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/699592 where on windows there was trouble with large number of child processes, and they had some success switching impi.dll to the debug version, although they were observing an outright crash and not a hang.

Any help/suggestions on how to debug greatly appreciated.