HPCC benchmark HPL results degrade as more cores are used

Wilson_P_ · ‎12-07-2017

I have a 6-node cluster consisting of 12 cores per node with a total of 72 cores.

When running the HPCC benchmark on 6 cores - 1 core per node, 6 nodes - HPL results is 1198.87 GFLOPS. However, running HPCC on all available cores of the 6-node cluster, for a total of 72 cores, HPL results is 847.421 GFLOPS.

MPI Library Used: Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)

Options to mpiexec.hydra:
-print-rank-map
-pmi-noaggregate
-nolocal
-genvall
-genv I_MPI_DEBUG 5
-genv I_MPI_HYDRA_IFACE ens2f0
-genv I_MPI_FABRICS shm:tcp
-n 72
-ppn 12
-ilp64
--hostname filename

Any ideas?

Thanks in advance.

McCalpinJohn · ‎12-07-2017

I would assume that Intel's MKL library is using all of the cores on each node. That is the default unless you compile with the serial version of MKL or specify MKL_NUM_THREADS=1. The lower performance for your second test is probably due to the overhead of using 72 MPI tasks (running with 1 thread per task) rather than 6 MPI tasks (running with 12 threads per task).

Wilson_P_ · ‎12-07-2017

John,

Yes, the intention is to run the HPCC benchmark on all 72 available cores.

P and Q have been set to 8 and 9 in the HPCC configuration file (hpccinf.txt)

Do you have any recommendations for reducing this overhead?

Wilson

McCalpinJohn · ‎12-07-2017

You did not tell us what sort of system you are running, so it is not possible to tell if the performance you reported is near expected values.

The best performance is likely to be obtained using one MPI task per node or one MPI task per socket (if these are multi-socket nodes). Using more MPI tasks gives each task less data to work on, so compute to communication ratio is reduced. For a fixed problem size, reducing the compute to communication ratio corresponds to an increase in communication. This will always reduce performance, but the degree of reduction depends on the cluster interconnect and the problem size.

Wilson_P_ · ‎12-07-2017

Thanks for the replies, John.

The cluster consists of 6 Intel(R) Xeon(R) CPU D-1557 @ 1.50GHz nodes, each containing 12 cores. Each node has 16 GB of RAM.

Wilson

McCalpinJohn · ‎12-08-2017

The Xeon D-1557 uses the Broadwell core, so its peak performance is 2 256-bit FMA instructions per cycle, or 16 FLOPS per cycle. As far as I can tell, Intel has not published the minimum guaranteed frequency for power-limited AVX256 operation for any of the Broadwell servers, but we can start with the assumption that it will be close to the nominal frequency of 1.50 GHz. 12 cores * 16 FP Ops/cycle * 1.5 GHz = 288 GFLOPS peak per node. For the 6 nodes, the peak would be 1728 GFLOPS at 1.5 GHz. These values should be adjusted by the actual average frequency that you get during the run. This is reported by the Intel xhpl and mp_linpack binaries, or can be obtained using the "perf stat" command on Linux systems.

Your best result using 6 nodes is 1199 GFLOPS, or 69.4% of this peak estimate. This is not a bad number. Whether it is possible to do better depends on the interconnect you are using. These nodes don't have much memory, so you are limited to a maximum problem size of about N=100,000. The problem size determines the compute to communications ratio, so systems with more memory can run bigger problems to reduce the relative impact of slow interconnect between nodes. On machines with very large memory, the execution time for these big problems can get intolerable, but it looks like your memory is small enough that even the largest problem will take only about 10 minutes.

On systems with both large memory and fast interconnect (e.g., InfiniBand or OmniPath), it is typically possible to get between 80%-90% of peak (based on the actual average frequency during the run) on systems with Haswell/Broadwell processors when using Intel's optimized binaries.