Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

mpi_spawn_multiple will only start 14 processes - invalid argument if more

Dave17
Beginner
1,328 Views

Simple program will only spawn 14 processes.  Fails on the 15 and higher even though  there 

 

kellerd@ganon017 spawn_reproducer_23]$ env|grep MPI
I_MPI_SPAWN=on
VT_MPI=impi4
I_MPI_DEBUG=10
I_MPI_JOB_STARTUP_TIMEOUT=60
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0

 

cat /proc/cpuinfo

processor : 47
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
stepping : 7
microcode : 0x5003303
cpu MHz : 1855.334
cache size : 36608 KB
physical id : 1
siblings : 24
core id : 27
cpu cores : 24

...

[kellerd@ganon017 spawn_reproducer_23]$ mpiifort -v
mpiifort for the Intel(R) MPI Library 2021.10 for Linux*
Copyright Intel Corporation.
ifort version 2021.10.0

 

export I_MPI_SPAWN=on
export I_MPI_DEBUG=10
mpiifort -g bug.f90

mpirun -np 1 ./a.out 14

[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (405 MB per rank) * (14 local ranks) = 5681 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 43094 ganon017 {2,6,10}
[0] MPI startup(): 1 43095 ganon017 {14,18,22}
[0] MPI startup(): 2 43096 ganon017 {26,30,34}
[0] MPI startup(): 3 43097 ganon017 {38,42,46}
[0] MPI startup(): 4 43098 ganon017 {0,4,8}
[0] MPI startup(): 5 43099 ganon017 {12,16,20}
[0] MPI startup(): 6 43100 ganon017 {24,28,32}
[0] MPI startup(): 7 43101 ganon017 {36,40,44}
[0] MPI startup(): 8 43102 ganon017 {1,5,9}
[0] MPI startup(): 9 43103 ganon017 {13,17,21}
[0] MPI startup(): 10 43104 ganon017 {25,29,33}
[0] MPI startup(): 11 43105 ganon017 {3,7,11}
[0] MPI startup(): 12 43106 ganon017 {15,19,23}
[0] MPI startup(): 13 43107 ganon017 {27,31,35}
[0] MPI startup(): I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/cm/shared/software/intel/2023.2
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT=60
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=2
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): I_MPI_SPAWN=on

nspawn= 14

errcodes= 0 0 0 0 0
0 0 0 0 0 0
0 0 0
I'm a spawned process on ganon017 global_rank=005 node_rank=000
I'm a spawned process on ganon017 global_rank=009 node_rank=000
I'm a spawned process on ganon017 global_rank=000 node_rank=000
I'm a spawned process on ganon017 global_rank=002 node_rank=000
I'm a spawned process on ganon017 global_rank=008 node_rank=000
I'm a spawned process on ganon017 global_rank=010 node_rank=000
I'm a spawned process on ganon017 global_rank=001 node_rank=000
I'm a spawned process on ganon017 global_rank=004 node_rank=000
I'm a spawned process on ganon017 global_rank=011 node_rank=000
I'm a spawned process on ganon017 global_rank=003 node_rank=000
I'm a spawned process on ganon017 global_rank=006 node_rank=000
I'm a spawned process on ganon017 global_rank=007 node_rank=000
I'm a spawned process on ganon017 global_rank=012 node_rank=000
I'm a spawned process on ganon017 global_rank=013 node_rank=000

 

[kellerd@ganon017 spawn_reproducer_23]$ mpirun -np 1 ./a.out 15
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 43336 ganon017 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47}
[0] MPI startup(): I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/cm/shared/software/intel/2023.2
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT=60
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): I_MPI_SPAWN=on
nspawn= 15
Abort(537481996) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_spawn_multiple: Invalid argument, error stack:
PMPI_Comm_spawn_multiple(158): MPI_Comm_spawn_multiple(count=15, cmds=0x7f2000ae4c40, argvs=(nil), maxprocs=0x498a40, infos=0x550bc0, root=0, MPI_COMM_WORLD, intercomm=0x7fffffff7d24, errors=0x550de0) failed
PMPI_Comm_spawn_multiple(115): Invalid value for array_of_maxprocs[i], must be non-negative but is -1

 

 

 

0 Kudos
1 Solution
Rafael_L_Intel
Employee
1,152 Views

Hello Dave17,

 

There seems to be a misunderstanding concerning the usage of MPI_Comm_spawn_multiple there.

It will spawn array_of_maxprocs[i] ranks for the command in array_of_commands[i]. Hence, these variables must have the same dimensions. In your example, you should allocate(np(nspawn)) and then set np(:)=1 to have the behaviour you seem to be expecting. However, you could also work with nspawn=1 and np(1)=15, or simply use MPI_Comm_spawn instead.

 

Let me know if that answers your question!

Cheers,

Rafael

 

 

View solution in original post

0 Kudos
7 Replies
ShivaniK_Intel
Moderator
1,285 Views

Hi,


Thanks for posting in the Intel forums.


We are able to reproduce the issue at our end. We are working on it and will get back to you soon.


Thanks & Regards

Shivani


0 Kudos
Rafael_L_Intel
Employee
1,153 Views

Hello Dave17,

 

There seems to be a misunderstanding concerning the usage of MPI_Comm_spawn_multiple there.

It will spawn array_of_maxprocs[i] ranks for the command in array_of_commands[i]. Hence, these variables must have the same dimensions. In your example, you should allocate(np(nspawn)) and then set np(:)=1 to have the behaviour you seem to be expecting. However, you could also work with nspawn=1 and np(1)=15, or simply use MPI_Comm_spawn instead.

 

Let me know if that answers your question!

Cheers,

Rafael

 

 

0 Kudos
Dave17
Beginner
1,023 Views

Hi Rafael,

 

As a follow-up to this thread - 

 

If I have two nodes, and mpirun is used such that there are two nodes assigned with one 'mother' process running on each node,

and I want to have 10 spawned processes running on each node:  I find that each 'mother' process must call mpi_spawn_multiple with nspawn=20.  Is there a chance I could end up with with one node having 15 spawned processes and the other with only 5?

 

Dave

 

 

0 Kudos
Rafael_L_Intel
Employee
999 Views

Hi Dave17,

 

You can create an MPI_Info object and set the key "hosts" or "hostfile" just like you would pass it to mpirun. That would tell MPI_Comm_Spawn where the processes will spawn. 

 

Cheers!

Rafael

0 Kudos
Dave17
Beginner
1,133 Views

What I found confusing that perhaps led to my missundertanding of usage is that the parameter 'maxprocs' defines the maximum number of spawned processes in total as opposed to the number to be spawned by  each individual process.

If for instance our application requires each process intitiated on two separate nodes to spawn 3 worker processes -then the call to spawn must have maxprocs=6 for each of them.

I tested your suggestion for 'spawn' and 'spawn_mutiple' and both worked.  Since we only spawn identical processes we will be using 'spawn'.

Thanks for the help and quick response.

0 Kudos
Rafael_L_Intel
Employee
1,119 Views

I'm glad to help!

 

Cheers,

Rafael

0 Kudos
ShivaniK_Intel
Moderator
1,102 Views

Hi,


Thanks for accepting our solution. If you need any further information please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards

Shivani


0 Kudos
Reply