- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MPI_Comm_spawn_multiple works fine when two main processes starting 5 spawned processes has a memory error when 6 or more are spawned. A similar issue in in a separate post fails when greater than 14 processes are started by a single MPI main process. This memory error occurs instead from two MPI processes attempt to spawn more than 5 processes. It gets memory corruption
[kellerd@ganon017 spawn_reproducer_23]$ env|grep MPI
I_MPI_SPAWN=on
VT_MPI=impi4
I_MPI_DEBUG=10
I_MPI_JOB_STARTUP_TIMEOUT=60
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[kellerd@ganon017 spawn_reproducer_23]$ mpiifort -v
mpiifort for the Intel(R) MPI Library 2021.10 for Linux*
Copyright Intel Corporation.
ifort version 2021.10.0
good run
mpiifort -g -O0 spawnany.f90
mpirun -np 2 ./a.out 5
nspawn= 5
nspawn= 5
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (522 MB per rank) * (10 local ranks) = 5229 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 45776 ganon017 {2,6,10,14}
[0] MPI startup(): 1 45777 ganon017 {18,22,26,30}
[0] MPI startup(): 2 45778 ganon017 {34,38,42,46}
[0] MPI startup(): 3 45779 ganon017 {0,4,8,12}
[0] MPI startup(): 4 45780 ganon017 {16,20,24,28}
[0] MPI startup(): 5 45781 ganon017 {32,36,40,44}
[0] MPI startup(): 6 45782 ganon017 {1,5,9,13}
[0] MPI startup(): 7 45783 ganon017 {17,21,25,29}
[0] MPI startup(): 8 45784 ganon017 {3,7,11,15}
[0] MPI startup(): 9 45785 ganon017 {19,23,27,31}
[0] MPI startup(): I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/cm/shared/software/intel/2023.2
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT=60
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=2
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): I_MPI_SPAWN=on
errcodes= 0 0 0 0 0
errcodes= 0 0 0 0 0
I'm a spawned process on ganon017 global_rank=000 node_rank=000
I'm a spawned process on ganon017 global_rank=001 node_rank=001
I'm a spawned process on ganon017 global_rank=002 node_rank=002
I'm a spawned process on ganon017 global_rank=003 node_rank=003
I'm a spawned process on ganon017 global_rank=004 node_rank=004
I'm a spawned process on ganon017 global_rank=005 node_rank=005
I'm a spawned process on ganon017 global_rank=006 node_rank=006
I'm a spawned process on ganon017 global_rank=007 node_rank=007
I'm a spawned process on ganon017 global_rank=008 node_rank=008
I'm a spawned process on ganon017 global_rank=009 node_rank=009
bad run
[kellerd@ganon017 spawn_reproducer_23]$ mpirun -np 2 ./a.out 6
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (1211 MB per rank) * (2 local ranks) = 2422 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 45892 ganon017 {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46}
[0] MPI startup(): 1 45893 ganon017 {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47}
[0] MPI startup(): I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/cm/shared/software/intel/2023.2
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT=60
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): I_MPI_SPAWN=on
nspawn= 6
nspawn= 6
*** Error in `./a.out': malloc(): memory corruption: 0x0000000000550fd0 ***
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in the Intel forums.
Could you please provide us with the sample reproducer code to investigate the issue at our end?
Could you also please provide us with the OS details?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[kellerd@ganon017 spawn_reproducer_23]$ uname -a
Linux ganon017 3.10.0-1160.88.1.el7.x86_64 #1 SMP Sat Feb 18 13:27:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[kellerd@ganon017 spawn_reproducer_23]$ cat spawnany.f90
program test
implicit none
include "mpif.h"
character(len=MPI_MAX_PROCESSOR_NAME) :: node_name
integer node_name_len,global_rank,node_rank
integer parent_comm,node_comm,spawn_comm
character*25, allocatable, dimension(:) :: cmds
integer,allocatable,dimension(:) :: errcodes, infos
integer :: ierr,nspawn,np(2)
character(len=255) :: cmd, nspawnstr
call MPI_Init(ierr)
!cmds(:)="a.out"
!np(:)=3
!infos(:)=MPI_INFO_NULL
call MPI_Comm_rank(MPI_COMM_WORLD,global_rank,ierr)
call MPI_Get_processor_name(node_name,node_name_len,ierr)
call MPI_Comm_get_parent(parent_comm,ierr)
if (parent_comm==MPI_COMM_NULL) then
! Set up for spawn_multiple
call getarg(0,cmd)
call getarg(1,nspawnstr)
read(nspawnstr,*) nspawn
print*,"nspawn=",nspawn
allocate(infos(nspawn))
allocate(cmds(nspawn))
allocate(errcodes(nspawn))
infos(:)=MPI_INFO_NULL
np(:)=nspawn
cmds(:)= cmd
call MPI_Comm_spawn_multiple(nspawn, cmds, MPI_ARGVS_NULL, np, infos, 0,MPI_COMM_WORLD, &
spawn_comm, errcodes,ierr)
print*,"errcodes=",errcodes
call check_err(ierr,'mpi_comm_rank')
! call sleep(50)
else !/*spawned*/
call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,MPI_INFO_NULL,node_comm,ierr)
call MPI_Comm_rank(node_comm, node_rank,ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,global_rank,ierr)
call MPI_Get_processor_name(node_name,node_name_len,ierr)
write(*,'(a,a,a,i3.3,a,i3.3)') "I'm a spawned process on ",trim(node_name), &
" global_rank=",global_rank," node_rank=",node_rank
endif
call MPI_Finalize(ierr)
contains
subroutine check_err(ierr, str)
integer :: ierr
character(LEN=*) :: str
if (ierr /= 0) then
write(*,*) 'Error ', ierr, str ,' on ', global_rank
stop
end if
end subroutine check_err
end program test
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are able to reproduce the issue at our end. We are working on it and will get back to you soon.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
There seems to be a misunderstanding concerning the usage of MPI_Comm_spawn_multiple.
cmds , info and np arrays must have the same size.
If this helps to resolve your issue, please accept it as a solution.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we did not hear back from you could you please let us know whether your issue is resolved or not.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Since we didn't hear back from you, we assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards
Shivani
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page