Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2166 Discussions

MPI_Comm_spawn_multiple malloc(): memory corruption with more than 5 spawned

Dave17
Beginner
1,361 Views

MPI_Comm_spawn_multiple works fine when two main processes starting 5 spawned processes has a memory error when 6 or more are spawned.  A similar issue in in a separate post fails when greater than 14 processes are started by a single MPI main process.  This memory error occurs instead from two MPI processes attempt to spawn more than 5 processes.  It gets memory corruption

 

[kellerd@ganon017 spawn_reproducer_23]$ env|grep MPI
I_MPI_SPAWN=on
VT_MPI=impi4
I_MPI_DEBUG=10
I_MPI_JOB_STARTUP_TIMEOUT=60
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0

[kellerd@ganon017 spawn_reproducer_23]$ mpiifort -v
mpiifort for the Intel(R) MPI Library 2021.10 for Linux*
Copyright Intel Corporation.
ifort version 2021.10.0

 

good run

 

mpiifort -g -O0 spawnany.f90

mpirun -np 2 ./a.out 5

nspawn= 5
nspawn= 5
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (522 MB per rank) * (10 local ranks) = 5229 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 45776 ganon017 {2,6,10,14}
[0] MPI startup(): 1 45777 ganon017 {18,22,26,30}
[0] MPI startup(): 2 45778 ganon017 {34,38,42,46}
[0] MPI startup(): 3 45779 ganon017 {0,4,8,12}
[0] MPI startup(): 4 45780 ganon017 {16,20,24,28}
[0] MPI startup(): 5 45781 ganon017 {32,36,40,44}
[0] MPI startup(): 6 45782 ganon017 {1,5,9,13}
[0] MPI startup(): 7 45783 ganon017 {17,21,25,29}
[0] MPI startup(): 8 45784 ganon017 {3,7,11,15}
[0] MPI startup(): 9 45785 ganon017 {19,23,27,31}
[0] MPI startup(): I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/cm/shared/software/intel/2023.2
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT=60
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=2
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): I_MPI_SPAWN=on
errcodes= 0 0 0 0 0
errcodes= 0 0 0 0 0

I'm a spawned process on ganon017 global_rank=000 node_rank=000
I'm a spawned process on ganon017 global_rank=001 node_rank=001
I'm a spawned process on ganon017 global_rank=002 node_rank=002
I'm a spawned process on ganon017 global_rank=003 node_rank=003
I'm a spawned process on ganon017 global_rank=004 node_rank=004
I'm a spawned process on ganon017 global_rank=005 node_rank=005
I'm a spawned process on ganon017 global_rank=006 node_rank=006
I'm a spawned process on ganon017 global_rank=007 node_rank=007
I'm a spawned process on ganon017 global_rank=008 node_rank=008
I'm a spawned process on ganon017 global_rank=009 node_rank=009

 

bad run

 

[kellerd@ganon017 spawn_reproducer_23]$ mpirun -np 2 ./a.out 6
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (1211 MB per rank) * (2 local ranks) = 2422 MB total
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/cm/shared/software/intel/2023.2/mpi/2021.10.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 45892 ganon017 {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46}
[0] MPI startup(): 1 45893 ganon017 {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47}
[0] MPI startup(): I_MPI_ROOT=/cm/shared/software/intel/2023.2/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/cm/shared/software/intel/2023.2
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT=60
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=ssh
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): I_MPI_SPAWN=on
nspawn= 6
nspawn= 6
*** Error in `./a.out': malloc(): memory corruption: 0x0000000000550fd0 ***

 

 

 

 

0 Kudos
8 Replies
Dave17
Beginner
1,330 Views
The source code was provided.
0 Kudos
Dave17
Beginner
1,330 Views
Can others see the file I sent when I set up the original post?
0 Kudos
ShivaniK_Intel
Moderator
1,306 Views

Hi,


Thanks for posting in the Intel forums.


Could you please provide us with the sample reproducer code to investigate the issue at our end?


Could you also please provide us with the OS details?


Thanks & Regards

Shivani


0 Kudos
Dave17
Beginner
1,280 Views

[kellerd@ganon017 spawn_reproducer_23]$ uname -a
Linux ganon017 3.10.0-1160.88.1.el7.x86_64 #1 SMP Sat Feb 18 13:27:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

[kellerd@ganon017 spawn_reproducer_23]$ cat spawnany.f90
program test
implicit none
include "mpif.h"

character(len=MPI_MAX_PROCESSOR_NAME) :: node_name
integer node_name_len,global_rank,node_rank
integer parent_comm,node_comm,spawn_comm
character*25, allocatable, dimension(:) :: cmds
integer,allocatable,dimension(:) :: errcodes, infos
integer :: ierr,nspawn,np(2)
character(len=255) :: cmd, nspawnstr

call MPI_Init(ierr)

!cmds(:)="a.out"
!np(:)=3
!infos(:)=MPI_INFO_NULL


call MPI_Comm_rank(MPI_COMM_WORLD,global_rank,ierr)
call MPI_Get_processor_name(node_name,node_name_len,ierr)
call MPI_Comm_get_parent(parent_comm,ierr)

if (parent_comm==MPI_COMM_NULL) then
! Set up for spawn_multiple
call getarg(0,cmd)
call getarg(1,nspawnstr)
read(nspawnstr,*) nspawn
print*,"nspawn=",nspawn
allocate(infos(nspawn))
allocate(cmds(nspawn))
allocate(errcodes(nspawn))
infos(:)=MPI_INFO_NULL
np(:)=nspawn
cmds(:)= cmd
call MPI_Comm_spawn_multiple(nspawn, cmds, MPI_ARGVS_NULL, np, infos, 0,MPI_COMM_WORLD, &
spawn_comm, errcodes,ierr)
print*,"errcodes=",errcodes
call check_err(ierr,'mpi_comm_rank')
! call sleep(50)
else !/*spawned*/
call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,MPI_INFO_NULL,node_comm,ierr)
call MPI_Comm_rank(node_comm, node_rank,ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,global_rank,ierr)
call MPI_Get_processor_name(node_name,node_name_len,ierr)

write(*,'(a,a,a,i3.3,a,i3.3)') "I'm a spawned process on ",trim(node_name), &
" global_rank=",global_rank," node_rank=",node_rank
endif
call MPI_Finalize(ierr)

contains
subroutine check_err(ierr, str)
integer :: ierr
character(LEN=*) :: str
if (ierr /= 0) then
write(*,*) 'Error ', ierr, str ,' on ', global_rank
stop
end if
end subroutine check_err

end program test

 

0 Kudos
ShivaniK_Intel
Moderator
1,140 Views

Hi,

 

We are able to reproduce the issue at our end. We are working on it and will get back to you soon.

 

Thanks & Regards

Shivani

 

0 Kudos
ShivaniK_Intel
Moderator
1,038 Views

Hi,

 

There seems to be a misunderstanding concerning the usage of MPI_Comm_spawn_multiple.


cmds , info and np arrays must have the same size.


If this helps to resolve your issue, please accept it as a solution.


Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
946 Views

Hi,


As we did not hear back from you could you please let us know whether your issue is resolved or not.


Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
850 Views

Hi,


Since we didn't hear back from you, we assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards

Shivani


0 Kudos
Reply