MPI_Allreduce is toooo slow

seongyun_k_ · ‎02-19-2017

I use the benchmark in Intel MPI (IMB) to measure the performance of MPI_Allreudce over a rack with 25 machines equipped with Infiniband 40G switch. (I use the latest version of parallel studio 2017 on CentOS7 with linux kernel 3.1)

mpiexec.hydra -genvall -n 25 -machinefile ./machines ~/bin/IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000,128

#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Mon Feb 20 16:40:26 2017
# Machine : x86_64
# System : Linux
# Release : 3.10.0-327.el7.x86_64
# Version : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version : 3.0

...

# /home/syko/Turbograph-DIST/linux_ver/bin//IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000
#

# Minimum message length in bytes: 0
# Maximum message length in bytes: 536870912
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 25
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.12 0.18 0.15
67108864 2 298859.48 340774.54 329296.10
134217728 1 619451.05 727700.95 687140.46
268435456 1 1104426.86 1215415.00 1177512.81
536870912 1 2217355.97 2396162.03 2331228.14

# All processes entering MPI_Finalize

So I conclude that the performance of MPI_Allreduce is about (# of bytes / sizeof (float) / ElapsedTime) ~= 57 Mega Elementes / sec

This throughput number is far below than I expected. The network bandwidth usage is also much lower than the maximum bandwidth of Infiniband.
Is this performance number is acceptable in Intel MPI? Otherwise, is there something I can do to improve it?
(I tried varying 'I_MPI_ADJUST_ALLREDUCE', but were not satisfied.)

McCalpinJohn · ‎02-20-2017

Do you have an analytical model of what you think the performance should be?

I don't typically use MPI_Allreduce on large vectors, and I have never used it on MPI_BYTE data types, but it seems to me that a huge amount of data transfer is going to be required here.

One implementation (analogous to a common MPI_Alltoall algorithm) for the 64MiB size might be:

Each task copies its 64 MiB array to a local buffer.
For Step in 1..24
- Each Rank "P" send its 64MiB data to task "(P+Step)%25" while receiving 64 MiB from task "(P-Step)%25".
- Each Rank adds the 64MiB that it receives to the contents of the local buffer.
After 24 steps, all 25 ranks now have the same sum in their local buffers.

The total data traffic for this algorithm is 24*64MiB (~1.61e9 Bytes) sent by each task and 24*64MiB received by each task. At the minimum timing of 0.298 seconds, this corresponds to a sustained bandwidth of 1.61e9/0.298 = 5.4 GB/s per direction. This is slightly higher than the peak throughput of a 40 Gbit network, so a more clever algorithm must be in use.

My guess would be that the actual algorithm implemented is a binary tree summation, with a binary tree broadcast of the results. Since 25 is not a power of 2, one would have to "pretend" that there were 32 ranks, giving five steps up and five steps down. This is 10/24 of the traffic of the naive algorithm I initially suggested, so the timings correspond to ~2.25 GB/s per link (unidirectional). This is a bit slower than one might like, but by less than a factor of 2.

If you know a better algorithm, then you can repeat the analysis above and compute the effective average link utilization. Alternatively, you can enable the Intel MPI statistics feature and learn exactly how much data has been moved between ranks.

Yuan_C_Intel · ‎02-24-2017

I have transferred this to Intel® Clusters and HPC Technology forum.

Thanks