HPLinpack power consumption profile on Sandybridge server

SamG · ‎03-24-2018

Hi,

I am currently using sandybridge servers with E5-2670 chips in a dual socket configuration.

I am trying to understand the correlation between:

1. Processor package power consumption (measured using Intel RAPL in watts(package power domain)), and
2. CPU utilization percentage (from /proc/stat in %), and
3. DRAM access bandwidth (from Intel pcm-memory utility in MB/s)

for a complete run of HPLinpack (compiled with Intel MKL). I was surprised that the processor power consumption is higher for regions of intense memory (DRAM) access compared to regions of intense CPU utilization. Processor power consumption profile (while running linpack) was almost opposite of the cpu percentage utilization profile (all CPUs combined). And the power consumption profile (while running linpack) closely resembled the DRAM access bandwidth profile.

Infact,
correlation factor of power consumption profile (RAPL pkg domain) with cpu utilization percentage = ~-0.9
correlation factor of power consumption profile with DRAM access bandwidth = ~0.9

I was rather expecting that the CPU will go into idle states (leading to a lowering of power consumption) during periods of intense DRAM access.

I have always known HPLinpack as a highly compute intensive workload and I am finding the above observation very difficult to explain. I did repeat the measurements and it does not look like I am making a mistake in the measurement part.

Any help is appreciated.

Thanks in advance.

McCalpinJohn · ‎03-26-2018

In the olden days, optimized versions of the LINPACK benchmark were associated with low memory bandwidth, but that is no longer the case. The short version of the story is:

Peak Floating-Point performance has increased more rapidly than memory bandwidth
Cache sizes (per core) are smaller now than they were in the mid-1990's.

Looking at DGEMM as a proxy for LINPACK, it is relatively easy to show that the memory traffic reduction due to blocking is proportional to the square root of the cache size. Intel cache sizes have been fixed at 2.5 MiB/core for quite a while, as the peak FP performance continues to outstrip the available memory bandwidth. See, for example, slide 14 ("Memory Bandwidth is Falling Behind: (GFLOP/s)/(GWord/s)") and slide 20 ("Intel Processor GFLOPS/Package Contributions over time") at http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/

The attached chart was in the backup materials for the presentation above, but did not make it into the final presentation. It shows that by the time we get to Xeon E5 v3 (Haswell), DGEMM requires about 1/2 of the maximum sustainable bandwidth of the processor. In a 2-socket system, accidentally placing all the memory on one socket limits the bandwidth enough to prevent good scaling from one socket to two sockets. (In this case, simply interleaving the data between the two sockets is enough to provide good 1s to 2s scaling, but it is clear that the trend is ugly.) The unlabelled hashed data point on the right was my estimate (in November 2016) of where the Skylake Xeon processors would fall. The actual numbers for Skylake Xeon are not as bad -- I measure 100 GB/s (50% of max sustainable bandwidth) for a 2-socket Xeon Platinum 8160 running single-node HPL (Intel's version), and about 33 GB/s (33% of max sustainable bandwidth) for DGEMM running on a single socket of the same system. These are all reasonable numbers -- there are many different ways to combine cache blocking and parallelization that can provide a modest decrease in the bandwidth requirements -- but the important thing is that they are not small bandwidth values.