Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2211 Discussions

Intel MPI Allreduce Scalability Problem

MetMan
New Contributor I
764 Views

Hi, I'm having an issue with MPI_ALLreduce scalability using Intel MPI. On our cluster which each compute node has 64 cores, tested using IMB Benchmark. When using 48 cores and 60 cores per node (without using all cores of the node), they run with very different results. 48 cores per node case have significantly better results than 60 cores, what is the possible problem?

 

FYI, I submitted the job using sbatch, and Intel MPI uses the default settings, using a command similar to

"mpirun -np ${mpirun_np} -f hosts -perhost ${mpirun_perhost} IMB-MPI1 Allreduce -npmin 5400 -off_ cache 60,64"

 

I use Intel MPI 2021.6.0 version.

 

4320 MPI processes (48cores/node, 90nodes) result:

 

MetMan_0-1724815483936.png

 

5400 MPI processes (60cores/node, 90nodes) result:

 

MetMan_1-1724815550552.png

 

 

 

0 Kudos
11 Replies
TobiasK
Moderator
702 Views

@MetMan 
2021.6 is too old, please retry with the latest 2021.13.1 / oneAPI 2024.2.1

 

MPI performance may depend on a various set of variables, please post the full HW and SW environment, otherwise it's just guessing around.

0 Kudos
MetMan
New Contributor I
688 Views

I didn't explicitly set any MPI environment variables, but used the default setting.

 

Hw configuration:

 

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Gold 6458Q
Stepping: 8
CPU MHz: 3999.638
CPU max MHz: 3101.0000
CPU min MHz: 800.0000
BogoMIPS: 6200.00
Virtualization: VT-x
L1d cache: 48K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 61440K
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63

 

Do you mean using the impi_info tool to get SW environment variables?

 

0 Kudos
TobiasK
Moderator
621 Views

With SW environment we are referring to OS, SW stack for the nic
What do you use for interconnect?

 

Please add the output of 
I_MPI_DEBUG=10 mpirun -np ${mpirun_np} -f hosts -perhost ${mpirun_perhost} IMB-MPI1 Allreduce -npmin 5400 -off_ cache 60,64

0 Kudos
MetMan
New Contributor I
439 Views

Hi, TobiasK.

 

OS: centos 8.4.2105

Interconnect: Infiniband

 

The output of I_MPI_DEBUG=10 is too much. I paste useful info.

 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[1500] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total

...

[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[240] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[4140] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/hpc/software/mpi/intelmpi/2021.6.0/etc/tuning_icx_shm-ofi_mlx_400.dat" not found
[0] MPI startup(): Load tuning file: "/opt/hpc/software/mpi/intelmpi/2021.6.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1788071 cmbc1653 32
[0] MPI startup(): 1 1788072 cmbc1653 33
[0] MPI startup(): 2 1788073 cmbc1653 34
[0] MPI startup(): 3 1788074 cmbc1653 35
[0] MPI startup(): 4 1788075 cmbc1653 36
[0] MPI startup(): 5 1788076 cmbc1653 37
[0] MPI startup(): 6 1788077 cmbc1653 38
[0] MPI startup(): 7 1788078 cmbc1653 39
[0] MPI startup(): 8 1788079 cmbc1653 40
[0] MPI startup(): 9 1788080 cmbc1653 41
[0] MPI startup(): 10 1788081 cmbc1653 42
[0] MPI startup(): 11 1788082 cmbc1653 43
[0] MPI startup(): 12 1788083 cmbc1653 44
[0] MPI startup(): 13 1788084 cmbc1653 45
[0] MPI startup(): 14 1788085 cmbc1653 46
[0] MPI startup(): 15 1788086 cmbc1653 0
[0] MPI startup(): 16 1788087 cmbc1653 1
[0] MPI startup(): 17 1788088 cmbc1653 2
[0] MPI startup(): 18 1788089 cmbc1653 3
[0] MPI startup(): 19 1788090 cmbc1653 4
[0] MPI startup(): 20 1788091 cmbc1653 5
[0] MPI startup(): 21 1788092 cmbc1653 6
[0] MPI startup(): 22 1788093 cmbc1653 7
[0] MPI startup(): 23 1788094 cmbc1653 8
[0] MPI startup(): 24 1788095 cmbc1653 9
[0] MPI startup(): 25 1788096 cmbc1653 10
[0] MPI startup(): 26 1788097 cmbc1653 11
[0] MPI startup(): 27 1788098 cmbc1653 12
[0] MPI startup(): 28 1788099 cmbc1653 13
[0] MPI startup(): 29 1788100 cmbc1653 14
[0] MPI startup(): 30 1788101 cmbc1653 16
[0] MPI startup(): 31 1788102 cmbc1653 17
[0] MPI startup(): 32 1788103 cmbc1653 18
[0] MPI startup(): 33 1788104 cmbc1653 19
[0] MPI startup(): 34 1788105 cmbc1653 20
[0] MPI startup(): 35 1788106 cmbc1653 21
[0] MPI startup(): 36 1788107 cmbc1653 22
[0] MPI startup(): 37 1788108 cmbc1653 23
[0] MPI startup(): 38 1788109 cmbc1653 24
[0] MPI startup(): 39 1788110 cmbc1653 25
[0] MPI startup(): 40 1788111 cmbc1653 26
[0] MPI startup(): 41 1788112 cmbc1653 27
[0] MPI startup(): 42 1788113 cmbc1653 28
[0] MPI startup(): 43 1788114 cmbc1653 29
[0] MPI startup(): 44 1788115 cmbc1653 30
[0] MPI startup(): 45 1788116 cmbc1653 48
[0] MPI startup(): 46 1788117 cmbc1653 49
[0] MPI startup(): 47 1788118 cmbc1653 50
[0] MPI startup(): 48 1788119 cmbc1653 51
[0] MPI startup(): 49 1788120 cmbc1653 52
[0] MPI startup(): 50 1788121 cmbc1653 53
[0] MPI startup(): 51 1788122 cmbc1653 54
[0] MPI startup(): 52 1788123 cmbc1653 55
[0] MPI startup(): 53 1788124 cmbc1653 56
[0] MPI startup(): 54 1788125 cmbc1653 57
[0] MPI startup(): 55 1788126 cmbc1653 58
[0] MPI startup(): 56 1788127 cmbc1653 59
[0] MPI startup(): 57 1788128 cmbc1653 60
[0] MPI startup(): 58 1788129 cmbc1653 61
[0] MPI startup(): 59 1788130 cmbc1653 62
[0] MPI startup(): 60 661144 cmbc1654 32

...

[0] MPI startup(): I_MPI_ROOT=/opt/hpc/software/mpi/intelmpi/2021.6.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.4, MPI-1 part
#----------------------------------------------------------------
# Date : Fri Aug 30 12:42:54 2024
# Machine : x86_64
# System : Linux
# Release : 4.18.0-305.3.1.el8.x86_64
# Version : #1 SMP Tue Jun 1 16:14:33 UTC 2021
# MPI Version : 3.1
# MPI Thread Environment:


# Calling sequence was:

# imb/IMB-MPI1 Allreduce -npmin 5400 -off_cache 60,64

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 5400
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.09 0.03
4 1000 750.44 773.20 761.29
8 1000 797.67 843.62 826.09
16 1000 828.77 858.66 846.15
32 1000 814.64 858.83 841.92
64 1000 823.91 914.41 838.27
128 1000 850.97 944.32 871.04
256 1000 843.54 933.99 858.84
512 1000 844.91 945.35 867.06
1024 1000 845.97 938.68 860.53
2048 1000 855.51 956.05 869.80
4096 1000 862.21 954.42 874.54
8192 1000 855.78 951.25 874.08
16384 1000 994.88 1098.12 1022.46
32768 1000 1036.75 1149.22 1055.48
65536 640 1072.59 1180.20 1102.77
131072 320 1855.89 1943.43 1904.21
262144 160 1910.85 2016.92 1956.57
524288 80 2612.14 2770.10 2683.22
1048576 40 2359.23 2660.41 2484.20
2097152 20 3038.83 3444.00 3173.75
4194304 10 4575.82 5294.64 4807.26


# All processes entering MPI_Finalize

0 Kudos
MetMan
New Contributor I
373 Views

Hi, TobiasK.

 

OS: centos 8.4.2105

Interconnect: Infiniband

 

I only paste useful info from I_MPI_DEBUG=10 output:

 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[1500] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total

...

[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[240] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[4140] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/hpc/software/mpi/intelmpi/2021.6.0/etc/tuning_icx_shm-ofi_mlx_400.dat" not found
[0] MPI startup(): Load tuning file: "/opt/hpc/software/mpi/intelmpi/2021.6.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1788071 cmbc1653 32
[0] MPI startup(): 1 1788072 cmbc1653 33
[0] MPI startup(): 2 1788073 cmbc1653 34
[0] MPI startup(): 3 1788074 cmbc1653 35
[0] MPI startup(): 4 1788075 cmbc1653 36
[0] MPI startup(): 5 1788076 cmbc1653 37
[0] MPI startup(): 6 1788077 cmbc1653 38
[0] MPI startup(): 7 1788078 cmbc1653 39
[0] MPI startup(): 8 1788079 cmbc1653 40
[0] MPI startup(): 9 1788080 cmbc1653 41
[0] MPI startup(): 10 1788081 cmbc1653 42
[0] MPI startup(): 11 1788082 cmbc1653 43
[0] MPI startup(): 12 1788083 cmbc1653 44
[0] MPI startup(): 13 1788084 cmbc1653 45
[0] MPI startup(): 14 1788085 cmbc1653 46
[0] MPI startup(): 15 1788086 cmbc1653 0
[0] MPI startup(): 16 1788087 cmbc1653 1
[0] MPI startup(): 17 1788088 cmbc1653 2
[0] MPI startup(): 18 1788089 cmbc1653 3
[0] MPI startup(): 19 1788090 cmbc1653 4
[0] MPI startup(): 20 1788091 cmbc1653 5
[0] MPI startup(): 21 1788092 cmbc1653 6
[0] MPI startup(): 22 1788093 cmbc1653 7
[0] MPI startup(): 23 1788094 cmbc1653 8
[0] MPI startup(): 24 1788095 cmbc1653 9
[0] MPI startup(): 25 1788096 cmbc1653 10
[0] MPI startup(): 26 1788097 cmbc1653 11
[0] MPI startup(): 27 1788098 cmbc1653 12
[0] MPI startup(): 28 1788099 cmbc1653 13
[0] MPI startup(): 29 1788100 cmbc1653 14
[0] MPI startup(): 30 1788101 cmbc1653 16
[0] MPI startup(): 31 1788102 cmbc1653 17
[0] MPI startup(): 32 1788103 cmbc1653 18
[0] MPI startup(): 33 1788104 cmbc1653 19
[0] MPI startup(): 34 1788105 cmbc1653 20
[0] MPI startup(): 35 1788106 cmbc1653 21
[0] MPI startup(): 36 1788107 cmbc1653 22
[0] MPI startup(): 37 1788108 cmbc1653 23
[0] MPI startup(): 38 1788109 cmbc1653 24
[0] MPI startup(): 39 1788110 cmbc1653 25
[0] MPI startup(): 40 1788111 cmbc1653 26
[0] MPI startup(): 41 1788112 cmbc1653 27
[0] MPI startup(): 42 1788113 cmbc1653 28
[0] MPI startup(): 43 1788114 cmbc1653 29
[0] MPI startup(): 44 1788115 cmbc1653 30
[0] MPI startup(): 45 1788116 cmbc1653 48
[0] MPI startup(): 46 1788117 cmbc1653 49
[0] MPI startup(): 47 1788118 cmbc1653 50
[0] MPI startup(): 48 1788119 cmbc1653 51
[0] MPI startup(): 49 1788120 cmbc1653 52
[0] MPI startup(): 50 1788121 cmbc1653 53
[0] MPI startup(): 51 1788122 cmbc1653 54
[0] MPI startup(): 52 1788123 cmbc1653 55
[0] MPI startup(): 53 1788124 cmbc1653 56
[0] MPI startup(): 54 1788125 cmbc1653 57
[0] MPI startup(): 55 1788126 cmbc1653 58
[0] MPI startup(): 56 1788127 cmbc1653 59
[0] MPI startup(): 57 1788128 cmbc1653 60
[0] MPI startup(): 58 1788129 cmbc1653 61
[0] MPI startup(): 59 1788130 cmbc1653 62

...

[0] MPI startup(): I_MPI_ROOT=/opt/hpc/software/mpi/intelmpi/2021.6.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.4, MPI-1 part
#----------------------------------------------------------------
# Date : Fri Aug 30 12:42:54 2024
# Machine : x86_64
# System : Linux
# Release : 4.18.0-305.3.1.el8.x86_64
# Version : #1 SMP Tue Jun 1 16:14:33 UTC 2021
# MPI Version : 3.1
# MPI Thread Environment:


# Calling sequence was:

# imb/IMB-MPI1 Allreduce -npmin 5400 -off_cache 60,64

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 5400
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.09 0.03
4 1000 750.44 773.20 761.29
8 1000 797.67 843.62 826.09
16 1000 828.77 858.66 846.15
32 1000 814.64 858.83 841.92
64 1000 823.91 914.41 838.27
128 1000 850.97 944.32 871.04
256 1000 843.54 933.99 858.84
512 1000 844.91 945.35 867.06
1024 1000 845.97 938.68 860.53
2048 1000 855.51 956.05 869.80
4096 1000 862.21 954.42 874.54
8192 1000 855.78 951.25 874.08
16384 1000 994.88 1098.12 1022.46
32768 1000 1036.75 1149.22 1055.48
65536 640 1072.59 1180.20 1102.77
131072 320 1855.89 1943.43 1904.21
262144 160 1910.85 2016.92 1956.57
524288 80 2612.14 2770.10 2683.22
1048576 40 2359.23 2660.41 2484.20
2097152 20 3038.83 3444.00 3173.75
4194304 10 4575.82 5294.64 4807.26


# All processes entering MPI_Finalize

 

0 Kudos
MetMan
New Contributor I
592 Views

Hi, TobiasK.

 

OS: centos 8.4.2105

Interconnect: Infiniband

 

I only paste useful info from I_MPI_DEBUG=10 output:

 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[1500] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total

...

[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[240] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[4140] MPI startup(): shm segment size (137 MB per rank) * (60 local ranks) = 8277 MB total
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/hpc/software/mpi/intelmpi/2021.6.0/etc/tuning_icx_shm-ofi_mlx_400.dat" not found
[0] MPI startup(): Load tuning file: "/opt/hpc/software/mpi/intelmpi/2021.6.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575)
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1788071 cmbc1653 32
[0] MPI startup(): 1 1788072 cmbc1653 33
[0] MPI startup(): 2 1788073 cmbc1653 34
[0] MPI startup(): 3 1788074 cmbc1653 35
[0] MPI startup(): 4 1788075 cmbc1653 36
[0] MPI startup(): 5 1788076 cmbc1653 37
[0] MPI startup(): 6 1788077 cmbc1653 38
[0] MPI startup(): 7 1788078 cmbc1653 39
[0] MPI startup(): 8 1788079 cmbc1653 40
[0] MPI startup(): 9 1788080 cmbc1653 41
[0] MPI startup(): 10 1788081 cmbc1653 42
[0] MPI startup(): 11 1788082 cmbc1653 43
[0] MPI startup(): 12 1788083 cmbc1653 44
[0] MPI startup(): 13 1788084 cmbc1653 45
[0] MPI startup(): 14 1788085 cmbc1653 46
[0] MPI startup(): 15 1788086 cmbc1653 0
[0] MPI startup(): 16 1788087 cmbc1653 1
[0] MPI startup(): 17 1788088 cmbc1653 2
[0] MPI startup(): 18 1788089 cmbc1653 3
[0] MPI startup(): 19 1788090 cmbc1653 4
[0] MPI startup(): 20 1788091 cmbc1653 5
[0] MPI startup(): 21 1788092 cmbc1653 6
[0] MPI startup(): 22 1788093 cmbc1653 7
[0] MPI startup(): 23 1788094 cmbc1653 8
[0] MPI startup(): 24 1788095 cmbc1653 9
[0] MPI startup(): 25 1788096 cmbc1653 10
[0] MPI startup(): 26 1788097 cmbc1653 11
[0] MPI startup(): 27 1788098 cmbc1653 12
[0] MPI startup(): 28 1788099 cmbc1653 13
[0] MPI startup(): 29 1788100 cmbc1653 14
[0] MPI startup(): 30 1788101 cmbc1653 16
[0] MPI startup(): 31 1788102 cmbc1653 17
[0] MPI startup(): 32 1788103 cmbc1653 18
[0] MPI startup(): 33 1788104 cmbc1653 19
[0] MPI startup(): 34 1788105 cmbc1653 20
[0] MPI startup(): 35 1788106 cmbc1653 21
[0] MPI startup(): 36 1788107 cmbc1653 22
[0] MPI startup(): 37 1788108 cmbc1653 23
[0] MPI startup(): 38 1788109 cmbc1653 24
[0] MPI startup(): 39 1788110 cmbc1653 25
[0] MPI startup(): 40 1788111 cmbc1653 26
[0] MPI startup(): 41 1788112 cmbc1653 27
[0] MPI startup(): 42 1788113 cmbc1653 28
[0] MPI startup(): 43 1788114 cmbc1653 29
[0] MPI startup(): 44 1788115 cmbc1653 30
[0] MPI startup(): 45 1788116 cmbc1653 48
[0] MPI startup(): 46 1788117 cmbc1653 49
[0] MPI startup(): 47 1788118 cmbc1653 50
[0] MPI startup(): 48 1788119 cmbc1653 51
[0] MPI startup(): 49 1788120 cmbc1653 52
[0] MPI startup(): 50 1788121 cmbc1653 53
[0] MPI startup(): 51 1788122 cmbc1653 54
[0] MPI startup(): 52 1788123 cmbc1653 55
[0] MPI startup(): 53 1788124 cmbc1653 56
[0] MPI startup(): 54 1788125 cmbc1653 57
[0] MPI startup(): 55 1788126 cmbc1653 58
[0] MPI startup(): 56 1788127 cmbc1653 59
[0] MPI startup(): 57 1788128 cmbc1653 60
[0] MPI startup(): 58 1788129 cmbc1653 61
[0] MPI startup(): 59 1788130 cmbc1653 62

...

 

0 Kudos
MetMan
New Contributor I
373 Views

[0] MPI startup(): I_MPI_ROOT=/opt/hpc/software/mpi/intelmpi/2021.6.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.4, MPI-1 part
#----------------------------------------------------------------
# Date : Fri Aug 30 12:42:54 2024
# Machine : x86_64
# System : Linux
# Release : 4.18.0-305.3.1.el8.x86_64
# Version : #1 SMP Tue Jun 1 16:14:33 UTC 2021
# MPI Version : 3.1
# MPI Thread Environment:


# Calling sequence was:

# imb/IMB-MPI1 Allreduce -npmin 5400 -off_cache 60,64

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 5400
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.09 0.03
4 1000 750.44 773.20 761.29
8 1000 797.67 843.62 826.09
16 1000 828.77 858.66 846.15
32 1000 814.64 858.83 841.92
64 1000 823.91 914.41 838.27
128 1000 850.97 944.32 871.04
256 1000 843.54 933.99 858.84
512 1000 844.91 945.35 867.06
1024 1000 845.97 938.68 860.53
2048 1000 855.51 956.05 869.80
4096 1000 862.21 954.42 874.54
8192 1000 855.78 951.25 874.08
16384 1000 994.88 1098.12 1022.46
32768 1000 1036.75 1149.22 1055.48
65536 640 1072.59 1180.20 1102.77
131072 320 1855.89 1943.43 1904.21
262144 160 1910.85 2016.92 1956.57
524288 80 2612.14 2770.10 2683.22
1048576 40 2359.23 2660.41 2484.20
2097152 20 3038.83 3444.00 3173.75
4194304 10 4575.82 5294.64 4807.26


# All processes entering MPI_Finalize

 

0 Kudos
MetMan
New Contributor I
590 Views

[0] MPI startup(): I_MPI_ROOT=/opt/hpc/software/mpi/intelmpi/2021.6.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.4, MPI-1 part
#----------------------------------------------------------------
# Date : Fri Aug 30 12:42:54 2024
# Machine : x86_64
# System : Linux
# Release : 4.18.0-305.3.1.el8.x86_64
# Version : #1 SMP Tue Jun 1 16:14:33 UTC 2021
# MPI Version : 3.1
# MPI Thread Environment:


# Calling sequence was:

# imb/IMB-MPI1 Allreduce -npmin 5400 -off_cache 60,64

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM

 

0 Kudos
TobiasK
Moderator
569 Views

As I mentioned above, please try with the latest release version which is 2021.13.1 oneAPI 2024.2.1. Also make sure that your MLX stack is up to date with the latest LTS version installed.

0 Kudos
MetMan
New Contributor I
560 Views

Thank you, TobiasK.

I will try the latest IMPI version.

MLX stack installation may be need root permission which i don't own.

0 Kudos
MetMan
New Contributor I
239 Views

Hi TobiasK,

Sorry for taking so long to reply. I tried the latest Intel MPI version 2021.13 and found that the difference in ALLREDUCE time between using 48 and 60 cores per node is still significant.

 

MetMan_0-1727319966743.png

 

0 Kudos
Reply