Solved: MPI applications hangs with a limided number of processes

thierrybraconnier · ‎05-30-2021

Hello,

I am working on a MPI application which hangs when it is launched with more than 2071 MPI processes. I have succeeded to make a small reproducer of this:

program main
use mpi
integer :: ierr,rank
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,rank,ierr)
if (rank.eq.0) print *,'Start'
call test_func(ierr)
if (ierr.ne.0) call exit(ierr)
call mpi_finalize(ierr)
if (rank.eq.0) print *,'Stop'
contains

subroutine test_func(ierr)
integer, intent(out) :: ierr
real :: send,recv
integer :: i,j,status(MPI_STATUS_SIZE),mpi_rank,mpi_size,ires
character(len=10) :: procname
real(kind=8) :: t1,t2

ierr=0
call mpi_comm_size(MPI_COMM_WORLD,mpi_size,ierr)
call mpi_comm_rank(MPI_COMM_WORLD,mpi_rank,ierr)
call mpi_get_processor_name(procname, ires, ierr)
call mpi_barrier(MPI_COMM_WORLD,ierr)
t1 = mpi_wtime()
do j=0,mpi_size-1
if (mpi_rank.eq.j) then
do i=0,mpi_size-1
if (i.eq.j) cycle
call MPI_RECV(recv,1,MPI_REAL,i,0,MPI_COMM_WORLD,status,ierr)
if (ierr.ne.0) return
if (i.eq.mpi_size-1) print *,'Rank ',j,procname,' done'
enddo
else
call MPI_SEND(send,1,MPI_REAL,j,0,MPI_COMM_WORLD,ierr)
if (ierr.ne.0) return
endif
enddo
call mpi_barrier(MPI_COMM_WORLD,ierr)
t2 = mpi_wtime()
if (mpi_rank.eq.0) print*,"time send/recv = ",t2-t1
end subroutine test_func
end program main

When I run this program with less than 2071 MPI processes then it works but when I run it with more than 2072 processes then it hangs as if there are deadlock on the send/recv.

The outputs running the programm with I_MPI_DEBUG=5 are
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 48487 r30i0n0 {0,24}
...
[0] MPI startup(): 2070 34737 r30i4n14 {18,19,20,42,43,44}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/data_local/sw/intel/RHEL7/compilers_and_libraries_2020.4.304/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM=1
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM_FORCE=lustre
[0] MPI startup(): I_MPI_DEBUG=5

Question 1 : Is there a reason explaining this behavior?

Notice that if I change the send/recv communication pattern by either a bcast one
do j=0,mpi_size-1
if (mpi_rank.eq.j) then
call MPI_BCAST(send,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
else
call MPI_BCAST(recv,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
endif
if (ierr.ne.0) return
print *,'Rank ',j,procname,' done'
enddo

or an allgather one

call MPI_ALLGATHER(MPI_IN_PLACE,0,MPI_DATATYPE_NULL,recv,1,MPI_REAL,MPI_COMM_WORLD,ierr)
print *,'Rank ',mpi_rank,procname,' done '

then the programm runs (faster of course) but with up to 4000 MPI processes (I did not try with more MPI processes). Unfortunatly, I can not change the communication send/recv pattern in the original application with either the bcast or the allgather ones.

Question 2 : When I run the original application with 2064 MPI processes (86 nodes having 24 cores), the consummed memory for MPI buffers is around 60 GB per node and with 1032 MPI processes (43 nodes having 24 cores) it is around 30 GB per node. Is there a way (environment variables...) to reduce this amount of consummed memory?

Many thanks in advance for your help
Thierry

James_T_Intel · ‎09-23-2021

I apologize for the delayed response. I have tested this on 2019 Update 9 (the version of Intel® MPI Library included in Intel® Parallel Studio XE 2020 Update 4) and on 2021.3. The error appears in 2019 Update 9 and is resolved and works as expected in 2021.3. Please update your version of Intel® MPI Library to resolve this issue.

If you are unsure of how to upgrade, we have transitioned from Intel® Parallel Studio XE to Intel® oneAPI. Specifically, the Intel® MPI Library is part of Intel® oneAPI HPC Toolkit. Please go to https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.c2n2h8 for overview information about Intel® oneAPI and https://software.intel.com/content/www/us/en/develop/tools/oneapi/hpc-toolkit.html#gs.c2n2xn for information on how to download Intel® oneAPI HPC Toolkit.

As this is resolved with the latest version, I am proceeding with closing this thread for Intel support. Any further posts on this thread will be considered community only.

View solution in original post

SantoshY_Intel · ‎06-01-2021

Hi,

Thanks for reaching out to us.

Could you please keep the flags -check_mpi and I_MPI_DEBUG=10 to the command you used to run the code and share with us the complete error log?

Also, can you provide the result for the below command?

ulimit -a

Thanks & Regards,

Santosh

thierrybraconnier · ‎06-01-2021

Hello Santosh,

Here are the results of the ulimit -a command

Frontale node :
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1545489
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 30000000
open files (-n) 32768
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

compute node :
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 513931
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 32768
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 300000
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

The log ans err files with the -check_mpi flag ans I_MPI_DEBUG=10 for np=2071 (which works) and np=2072 (which hangs) are given as attched files.

Regards,

Thierry

thierrybraconnier · ‎06-08-2021

Hello Santosh,

Here are the results of the ulimit -a command

Frontale node :
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1545489
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 30000000
open files (-n) 32768
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

compute node :
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 513931
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 32768
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 300000
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

The log ans err files with the -check_mpi flag ans I_MPI_DEBUG=10 for np=2071 (which works) and np=2072 (which hangs) are given as attched files.

Regards,

Thierry

SantoshY_Intel · ‎06-09-2021

Hi,

We see that -check_mpi flag isn't worked for you. Follow the below steps to make use of -check_mpi flag.

source <installation path of parallel Studio> /itac/20XX.XX/bin/itacvars.sh

or

source <installation path of parallel studio>/ psxevars.sh

Now check cluster checker version by using the below command:

clck --version

If you see the version details, Now try running the program as below example:

I_MPI_DEBUG=10 mpirun -check_mpi -np <total No. of processes> -ppn <No. of processes per node ./a.out

Also, share with us the complete log.

Thanks & Regards,

Santosh

thierrybraconnier · ‎06-14-2021

Hello Santosh,

Thnaks for your answer.

Regarding parallel studio, it is not insatlled on my cluster but itac is and I have source the itacvars.sh script and launched the command

I_MPI_DEBUG=10 mpirun -check_mpi -np <total No. of processes> -ppn <No. of processes per node ./a.out.

I have got 3 differents behaviors according with the numbers of mpi process running

every thing is running fine (see test_np532.out and test_np532.err attached files)
the run hangs but some of the send/recv exchanges are completed (see test_np535.out and test_np535.err)
the run hangs and no send/recv exchanges are completed (see test_np540.out and test_np540.err)

Notice that

test_npXXX.xxx is the log and err files for np=XXX
problem seems to occur for less MPI processes: I guess it is because "itac" slows down the runs

Best regards,

Thierry

SantoshY_Intel · ‎06-21-2021

Hi,

Thanks for providing the logs. We are working on your issue and we will get back to you soon.

Thanks & Regards,

Santosh

James_T_Intel · ‎06-21-2021

Do you see the same behavior on the current version, 2021.2? What hardware are you using?

thierrybraconnier · ‎06-21-2021

Hello James,

The most recent compiler I have access on the cluster is intel-compxe/19.1.3.304 and the most recent MPI library is intel-mpi/2020U4.

Here are some information about the hardware I am using:

uname -a
Linux r29i3n17 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping: 2
CPU MHz: 2900.238
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 5000.18
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

Regards,

Thierry

thierrybraconnier · ‎07-11-2021

Hello James,

The most recent compiler I have access on the cluster is intel-compxe/19.1.3.304 and the most recent MPI library is intel-mpi/2020U4.

Here are some information about the hardware I am using:

uname -a
Linux r29i3n17 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping: 2
CPU MHz: 2900.238
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 5000.18
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

Regards,

Thierry

thierrybraconnier · ‎08-09-2021

Hello James,

The most recent compiler I have access on the cluster is intel-compxe/19.1.3.304 and the most recent MPI library is intel-mpi/2020U4.

Here are some information about the hardware I am using:

command uname -a
Linux r29i3n17 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

command lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping: 2
CPU MHz: 2900.238
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 5000.18
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

command ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.27.1016
Hardware version: 0
Node GUID: 0xb88303ffff8292fc
System image GUID: 0xb88303ffff8292fc
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 5281
LMC: 0
SM lid: 1
Capability mask: 0x2659e848
Port GUID: 0xb88303ffff8292fc
Link layer: InfiniBand

Do not hesitate to contact me if you need more info (just give me the command to get them).

Best regards,

Thierry

thierrybraconnier · ‎09-13-2021

Hello James,

Since I have got no answer from you for more than one month, I am coming back to you.

If you need more informations, please, send me the commands for obtaining them and I will send them to you.

Regards,

Thierry

James_T_Intel · ‎09-23-2021

I apologize for the delayed response. I have tested this on 2019 Update 9 (the version of Intel® MPI Library included in Intel® Parallel Studio XE 2020 Update 4) and on 2021.3. The error appears in 2019 Update 9 and is resolved and works as expected in 2021.3. Please update your version of Intel® MPI Library to resolve this issue.

If you are unsure of how to upgrade, we have transitioned from Intel® Parallel Studio XE to Intel® oneAPI. Specifically, the Intel® MPI Library is part of Intel® oneAPI HPC Toolkit. Please go to https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.c2n2h8 for overview information about Intel® oneAPI and https://software.intel.com/content/www/us/en/develop/tools/oneapi/hpc-toolkit.html#gs.c2n2xn for information on how to download Intel® oneAPI HPC Toolkit.

As this is resolved with the latest version, I am proceeding with closing this thread for Intel support. Any further posts on this thread will be considered community only.

thierrybraconnier · ‎09-23-2021

Hello James,

Many thanks for your answer, your help ans your time spent on the issue.

Best regards,

Thierry

MPI applications hangs with a limided number of processes

Fortran

MPI