Solved: MPI fortran programm hangs with a limit numbers of mpi processes

thierrybraconnier · ‎05-27-2021

Hello,

I am working on a MPI application which hangs when it is launched with more than 2071 MPI processes. I have succeeded to make a small reproducer of this:

program main

use mpi
integer :: ierr,rank

call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,rank,ierr)
if (rank.eq.0) print *,'Start'
call test_func(ierr)
if (ierr.ne.0) call exit(ierr)
call mpi_finalize(ierr)
if (rank.eq.0) print *,'Stop'

contains

subroutine test_func(ierr)
integer, intent(out) :: ierr
real :: send,recv
integer :: i,j,status(MPI_STATUS_SIZE),mpi_rank,mpi_size,ires
character(len=10) :: procname
real(kind=8) :: t1,t2

ierr=0
call mpi_comm_size(MPI_COMM_WORLD,mpi_size,ierr)
call mpi_comm_rank(MPI_COMM_WORLD,mpi_rank,ierr)
call mpi_get_processor_name(procname, ires, ierr)
call mpi_barrier(MPI_COMM_WORLD,ierr)
t1 = mpi_wtime()
do j=0,mpi_size-1
if (mpi_rank.eq.j) then
do i=0,mpi_size-1
if (i.eq.j) cycle
call MPI_RECV(recv,1,MPI_REAL,i,0,MPI_COMM_WORLD,status,ierr)
if (ierr.ne.0) return
if (i.eq.mpi_size-1) print *,'Rank ',j,procname,' done'
enddo
else
call MPI_SEND(send,1,MPI_REAL,j,0,MPI_COMM_WORLD,ierr)
if (ierr.ne.0) return
endif
enddo
call mpi_barrier(MPI_COMM_WORLD,ierr)
t2 = mpi_wtime()
if (mpi_rank.eq.0) print*,"time send/recv = ",t2-t1
end subroutine test_func

end program main

When I run this program with less than 2071 MPI processes then it works but when I run it with more than 2072 processes then it hangs as if there are deadlock on the send/recv.

The outputs running the programm with I_MPI_DEBUG=5 are

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 48487 r30i0n0 {0,24}
...
[0] MPI startup(): 2070 34737 r30i4n14 {18,19,20,42,43,44}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/data_local/sw/intel/RHEL7/compilers_and_libraries_2020.4.304/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM=1
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM_FORCE=lustre
[0] MPI startup(): I_MPI_DEBUG=5

Question 2 : Is there a reason explaining this behavior?

Notice that if I change the send/recv communication pattern by either a bcast one

do j=0,mpi_size-1
if (mpi_rank.eq.j) then
call MPI_BCAST(send,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
else
call MPI_BCAST(recv,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
endif
if (ierr.ne.0) return
print *,'Rank ',j,procname,' done'
enddo

or an allgather one

call MPI_ALLGATHER(MPI_IN_PLACE,0,MPI_DATATYPE_NULL,recv,1,MPI_REAL,MPI_COMM_WORLD,ierr)
print *,'Rank ',mpi_rank,procname,' done '

then the programm runs (faster of course) but with up to 4000 MPI processes (I did not try with more MPI processes). However, I can not change the communication send/recv pattern in the original application with the bcast or the allgather ones.

Question 2 : When I run the original application with 2064 MPI processes (86 nodes having 24 cores), the consummed memory for MPI buffers is around 60 GB per node and with 1032 MPI processes (43 nodes having 24 cores) it is around 30 GB per node. Is there a way (environment variables...) to reduce this amount of consummed memory?

Many thanks in advance for your help

Thierry

Steve_Lionel · ‎05-27-2021

I know it's far from obvious, but Intel MPI is discussed in Intel® oneAPI HPC Toolkit - Intel Community

View solution in original post

Steve_Lionel · ‎05-27-2021

I know it's far from obvious, but Intel MPI is discussed in Intel® oneAPI HPC Toolkit - Intel Community

MPI fortran programm hangs with a limit numbers of mpi processes

Runtime error