Solved: Thx a lot!

Lu_L_Intel · ‎05-22-2019

Node: 4 socket Intel ® Xeon® Gold 6148 Scalable Processors（20Core,150W,2.4GHz）with DRAM 256G Cluster: 14 nodes Linpack info: Ns 168000 NBs 384 Ps 1 Qs 1 run with : parallel_studio/mkl/benchmarks/mp_linpack/xhpl_intel64_static The Max peak for each node is around 4Tflops, compared with the theoretical peak is 6.144Tflops. The efficiency is 65%, is it low? Related info : Parts of Intel Endeavor cluster are made of Intel Xeon Gold 6148. The HPL data is at https://icl.utk.edu/hpcc/hpcc_record.cgi?id=585 .

McCalpinJohn · ‎05-22-2019

According to Figure 3 in the "Intel Xeon Processor Scalable Family: Specification Update" (document 336065-005, February 2018), the Xeon Gold 6148 processor has a "base" AVX-512 frequency of 1.6 GHz, and a maximum all-core AVX-512 frequency of 2.2 GHz. This gives a "base" peak performance of 20 cores * 32 FLOPS/Cycle * 1.4 GCycle/s = 0.896 TFLOPS per socket (3.584 TFLOPS/node), and a maximum theoretical peak for AVX-512 code of 20 cores * 32 FLOPS/Cycle * 2.2 GCycle/s = 1.408 TFLOPs per socket (5.632 TFLOPS/node). The 6.144 TFLOPS "theoretical" peak is based on the 2.4 GHz "nominal" frequency of the processor. This level of performance is not achievable (even theoretically) because the processor cannot run that fast while running AVX512 instructions.

The 3.81255 TFLOPS reported at the web site above is 106.4% of the "base" peak performance. This means that the processor is able to run faster than the "base" AVX-512 frequency of 1.6 GHz. The reported 3.8 TF is 67.7% of the *maximum* AVX-512 peak performance, but large HPL runs are always power-limited when running in AVX-512 mode on SKX processors, so you should not expect to see the maximum 2.2 GHz.

I have not tested many Xeon Gold 6148 processors, but our Xeon Platinum 8160 processors have a "base" AVX512 frequency of 1.4 GHz, a maximum all-core AVX-512 frequency of 2.0 GHz, and run HPL at an average frequency ranging from a low of 1.55 GHz to a high of 1.75 GHz. For each chip, the average frequency is a function of the leakage current of the die (which varies from die to die), the effectiveness of the cooling system at the location of the die, and a few smaller factors including DRAM frequency and uncore frequency. This 12.9% range was observed on our full set of 1736 2-socket nodes -- smaller subsets typically show less variation.

My own (extensive) measurements on Xeon Platinum 8160 processors show that single-socket HPL tends to run at about 92% of the "peak" based on the actual operating frequency of the processor during the test. Assuming a similar efficiency on Xeon Gold 6148 points to an average frequency of slightly over 1.6 GHz for the 3.8TF result and about 1.7 GHz for your "around 4 TFLOPs" result.

A further complication is that the version of HPL in the HPCC benchmark suite does static partitioning of the workload across the available processors. So a better way to describe the result is that the *slowest* socket has an average frequency of slightly over 1.6 GHz. The other sockets might be faster, but if they finish early they will just have to wait for the slowest socket to finish. Intel has a version of the HPL benchmark that allows non-uniform static partitioning of the data. After testing the HPL performance of each socket separately, the workload can be statically distributed to provide close to uniform execution time.

Finally, in my experience the version the HPL benchmark included in the HPCC suite is not optimal, even when linked with the Intel MKL libraries. Your results with parallel_studio/mkl/benchmarks/mp_linpack/xhpl_intel64_static are about 5% higher than the results at the HPCC web site. This might be due to efficiency differences in the benchmark or might be due to different thermal characteristics of the chips or different cooling capability on each system.

Some of these issues are including in my SC18 paper http://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-intel-xeon-platinum-8160-processors/

View solution in original post

McCalpinJohn · ‎05-22-2019

According to Figure 3 in the "Intel Xeon Processor Scalable Family: Specification Update" (document 336065-005, February 2018), the Xeon Gold 6148 processor has a "base" AVX-512 frequency of 1.6 GHz, and a maximum all-core AVX-512 frequency of 2.2 GHz. This gives a "base" peak performance of 20 cores * 32 FLOPS/Cycle * 1.4 GCycle/s = 0.896 TFLOPS per socket (3.584 TFLOPS/node), and a maximum theoretical peak for AVX-512 code of 20 cores * 32 FLOPS/Cycle * 2.2 GCycle/s = 1.408 TFLOPs per socket (5.632 TFLOPS/node). The 6.144 TFLOPS "theoretical" peak is based on the 2.4 GHz "nominal" frequency of the processor. This level of performance is not achievable (even theoretically) because the processor cannot run that fast while running AVX512 instructions.

The 3.81255 TFLOPS reported at the web site above is 106.4% of the "base" peak performance. This means that the processor is able to run faster than the "base" AVX-512 frequency of 1.6 GHz. The reported 3.8 TF is 67.7% of the *maximum* AVX-512 peak performance, but large HPL runs are always power-limited when running in AVX-512 mode on SKX processors, so you should not expect to see the maximum 2.2 GHz.

I have not tested many Xeon Gold 6148 processors, but our Xeon Platinum 8160 processors have a "base" AVX512 frequency of 1.4 GHz, a maximum all-core AVX-512 frequency of 2.0 GHz, and run HPL at an average frequency ranging from a low of 1.55 GHz to a high of 1.75 GHz. For each chip, the average frequency is a function of the leakage current of the die (which varies from die to die), the effectiveness of the cooling system at the location of the die, and a few smaller factors including DRAM frequency and uncore frequency. This 12.9% range was observed on our full set of 1736 2-socket nodes -- smaller subsets typically show less variation.

My own (extensive) measurements on Xeon Platinum 8160 processors show that single-socket HPL tends to run at about 92% of the "peak" based on the actual operating frequency of the processor during the test. Assuming a similar efficiency on Xeon Gold 6148 points to an average frequency of slightly over 1.6 GHz for the 3.8TF result and about 1.7 GHz for your "around 4 TFLOPs" result.

A further complication is that the version of HPL in the HPCC benchmark suite does static partitioning of the workload across the available processors. So a better way to describe the result is that the *slowest* socket has an average frequency of slightly over 1.6 GHz. The other sockets might be faster, but if they finish early they will just have to wait for the slowest socket to finish. Intel has a version of the HPL benchmark that allows non-uniform static partitioning of the data. After testing the HPL performance of each socket separately, the workload can be statically distributed to provide close to uniform execution time.

Finally, in my experience the version the HPL benchmark included in the HPCC suite is not optimal, even when linked with the Intel MKL libraries. Your results with parallel_studio/mkl/benchmarks/mp_linpack/xhpl_intel64_static are about 5% higher than the results at the HPCC web site. This might be due to efficiency differences in the benchmark or might be due to different thermal characteristics of the chips or different cooling capability on each system.

Some of these issues are including in my SC18 paper http://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-intel-xeon-platinum-8160-processors/

Giuseppe_C_ · ‎09-09-2022

Where can I find "Intel Xeon Processor Scalable Family: Specification Update" for platinum 8360Y? I need to know the frequency table. Thanks

Lu_L_Intel · ‎05-22-2019

Thanks for the quick and detailed reply.

In my understanding, the major factor is the AVX512 frequency. How to sample or trace the frequency when the Linpack running?

McCalpinJohn · ‎05-23-2019

Some versions of the Intel HPL benchmark provide intermediate results including core frequency, uncore frequency, power consumption, and temperature. I think that the core frequency is only reported for the "master" thread, and the uncore frequency, power, and temperature are only reported for the socket (or die) that the master thread is running on.... An example of what the output looks like is:

  Frac     N     PFact   Bcast    Swap  Update   PPerf BcstBW SwapBW   CPU      Kernel      Total Powr   Dpwr Tmp CFreq Ufreq WhoamI [rank,row,col]
  0.00       0   0.240   0.061   0.459   0.302   49.11      0      0   0.00      39.05      49.11  92.0  12.0  65 3.241 2.361 c506-134[0,0,0]
  0.32     384   0.438   0.099   0.507   4.628   26.83      0    760  72.47    2381.86    2042.42 151.6  16.5  65 1.616 1.502 c506-134[0,0,0]
  0.64     768   0.464   0.119   0.518   4.843   25.28      0    687  69.95    2261.83    2128.28 149.6  16.6  66 1.577 1.479 c506-134[0,0,0]
  0.96    1152   0.468   0.118   0.518   4.931   24.94      0    669  69.90    2207.01    2151.68 149.6  16.6  67 1.574 1.477 c506-134[0,0,0]
  1.28    1536   0.473   0.118   0.527   4.819   24.61      0    668  69.88    2243.91    2179.04 149.6  16.6  68 1.573 1.477 c506-134[0,0,0]
  1.60    1920   0.477   0.118   0.534   4.813   24.34      0    654  69.90    2231.92    2192.10 149.6  16.6  68 1.573 1.476 c506-134[0,0,0]
  1.92    2304   0.469   0.117   0.529   4.819   24.67      0    644  69.95    2214.72    2186.32 149.6  16.6  69 1.574 1.478 c506-134[0,0,0]
  2.24    2688   0.482   0.119   0.529   4.724   23.91      0    647  69.92    2244.55    2191.42 149.6  16.7  70 1.570 1.475 c506-134[0,0,0]
  2.56    3072   0.472   0.118   0.519   4.741   24.34      0    645  69.93    2221.82    2196.06 149.6  16.7  70 1.571 1.476 c506-134[0,0,0]
  2.88    3456   0.467   0.118   0.507   4.630   24.54      0    656  70.07    2260.31    2210.51 149.6  16.6  71 1.575 1.479 c506-134[0,0,0]
  3.20    3840   0.453   0.115   0.489   4.702   25.22      0    668  70.30    2211.34    2207.30 149.6  16.4  71 1.582 1.483 c506-134[0,0,0]
  3.52    4224   0.440   0.112   0.486   4.494   25.87      0    692  70.59    2298.07    2214.38 149.6  16.2  72 1.589 1.487 c506-134[0,0,0]
  3.84    4608   0.438   0.112   0.494   4.578   25.87      0    693  70.33    2241.39    2219.04 149.6  16.3  73 1.588 1.486 c506-134[0,0,0]
  4.16    4992   0.444   0.113   0.488   4.589   25.46      0    680  70.37    2221.13    2221.49 149.6  16.4  73 1.584 1.484 c506-134[0,0,0]
[...]
98.24  117888   0.007   0.001   0.002   0.017   28.08      0   1923  59.16     242.59    2237.54  87.6  10.2  74 2.946 2.394 c506-134[0,0,0]
 98.56  118272   0.007   0.001   0.001   0.014   25.65      0   2403  62.18     194.28    2237.48  84.6  10.9  74 2.970 2.395 c506-134[0,0,0]
 98.88  118656   0.006   0.001   0.001   0.012   22.25      0   2416  60.46     144.53    2237.44  76.5   9.8  73 2.909 2.394 c506-134[0,0,0]
 99.20  119040   0.006   0.001   0.000   0.010   16.94      0   2886  59.65      96.04    2237.40  65.4   8.4  73 2.933 2.394 c506-134[0,0,0]
 99.52  119424   0.005   0.000   0.000   0.008   12.27      0   1562  57.12      53.34    2237.37  78.4   9.9  72 2.843 2.395 c506-134[0,0,0]
 99.84  119808   0.002   0.000   0.000   0.003    2.61      0     14  49.58      20.65    2237.37  79.2   9.7  72 2.713 2.394 c506-134[0,0,0]
Peak Performance =    2242.89 GFlops /  2242.89 GFlops per node
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00L2L4      120000   384     1     1             515.44            2.23502e+03
HPL_pdgesv() start time Wed Feb 14 15:33:29 2018

HPL_pdgesv() end time   Wed Feb 14 15:42:05 2018

	HPL Efficiency by CPU Cycle   90.893%
	HPL Efficiency by BUS Cycle   69.695%

When a processor is operating in power-limited mode and all the cores are operating in the same SIMD mode (e.g., AVX-512), all the cores will typically run at the same frequency -- so getting only the frequency of core 0 is not usually a problem. Of course the uncore frequency, package power consumption, and (maximum) package temperature are the same for all cores on the same die, so the output above is most useful if you are running the benchmark on a single chip at a time.

When running HPL on multiple die, differences in average frequency will lead to load imbalances, and the frequency/power/temperature information from the "master" thread is less helpful. I developed https://github.com/jdmccalpin/periodic-performance-counters to monitor *all* cores and *all* sockets to help understand the HPL performance variability that I reported in my SC18 paper.

It looks like the version of the benchmark provided with MKL only provides performance and not frequency values. You can monitor frequency and energy use with "perf stat" if your kernel is recent enough. I was able to get 1-second sampling of reference cycles, cpu cycles, and package energy by modifying the "mpirun" command in the "runme_intel64_dynamic" script to:

perf stat -a -A -e power/energy-pkg/ -e ref-cycles -e cpu-cycles -I 1000 -o xhpl_perf_stat_1_second.txt mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT

For a small test case (N=50400) on a 2s Xeon Platinum 8160 system, the timed part of the execution took 43.88 seconds (1945 GFLOPS), while the "perf stat" output included 54 records -- including about 10 seconds for problem setup and result validation. Since the sample interval is very close to 1 second, the 1-second delta energy values in Joules are also very close to power consumption in Watts. This started low, jumped to 171W in the 7th second, then stayed very close to 150W (processor TDP) through the 50th second. Looking at just CPU0, the "ref-cycles" event incremented by just under 2.1 billion increments per second -- very close to the expected 2.1 GHz, and the "cpu-cycles" dropped to a minimum of 1.585 billion in the 9th second, before increasing to 1.8-1.9 billion for the remainder of the compute-intensive section. In this two-socket system, core 24 in in the other socket, and it is clearly the slow one, dropping to 1.45 GHz during seconds 9-16, then slowly increasing to just over 1.5 GHz by second 42. It is likely that the elevated frequency in socket 0 is due to it finishing faster and then spin-waiting in non-AVX512 code while waiting for socket 1 to finish its half of the work. Adding the event "-e core_power.lvl2_turbo_license" to the "perf stat" command provides information on how many cycles were spent running AVX-512 code. In the middle of the run, these numbers were almost the same on the two sockets (which they should be, since there is the same amount of work assigned to each socket), while the cpu-cycles were higher on socket 0 -- this is a pretty clear sign that cores in socket 0 were finishing first and spin-waiting. If the spin-waiting lasts longer than about 1 millisecond, the core will turn off the AVX512 units and ramp up to a higher frequency (typically the "non-AVX" frequency, corresponding to core_power.lvl0_turbo_license cycles).

I can't figure out how to force the Intel binary that you ran to run on a single socket, but if you can figure that out, the perf stat command will show you the frequency for each socket. Then you can use the instructions at https://software.intel.com/en-us/mkl-linux-developer-guide-heterogeneous-support-in-the-intel-distribution-for-linpack-benchmark to allocate the work in proportion to the expected performance of each socket.

Lu_L_Intel · ‎05-23-2019

Thanks for your nice reply. I will use perf event to check the AVX512 core frequency.

Atrashkevich__Andrei · ‎09-11-2019

John please answer. Why you use DP 16FLOPS/cycle for AVX512?

Here i see that AVX&FMA for Skylake is 32 DP Flops/cycle for one AVX512 module.

https://en.wikichip.org/w/images/5/50/avx-512_flops.png

McCalpinJohn · ‎09-13-2019

For AVX2, the SIMD register width is up to 256 bits. For 64-bit doubles, this is 4 elements. The FMA instruction performs one add and one multiply (2 FLOPS) on each element, for a total of 8 FLOPS per instructions. Processors that support AVX2/FMA have two functional units (ports 0 and 1) per physical core, so they have a peak FP operation rate of 16 per cycle.

For AVX512, the numbers are doubled, so 16 FLOPS per cycle per AVX512 FMA unit. Processor cores with a single AVX512 FMA unit have a peak FP operation rate of 16 per cycle, while those with two AVX512 FMA units have a peak FP operation rate of 32 per cycle.

For the Xeon Scalable Processors with one AVX512 FMA unit, the peak FP operation rate is the same for AVX2 and AVX512, except that the maximum Turbo frequency is higher when using AVX2.

Atrashkevich__Andrei · ‎09-17-2019

Thx a lot!

Surya_Narayanan_N_ · ‎11-17-2019

Hello John, Thank you for such wonderful explanation. I have been observing whatever you have mentioned here before reading this article. The only way to get better HPL efficiency is to do a socketwise run and going heterogeneous mode. I have few queires:

1. What are the BIOS settings in your machine that gave the best performance? - For example, "Uncore Freq Scaling (UFS)", "Stale AtoS", "Workload Configuration" and few other Cstate and Pstate parameters.

2. I observed that when running on multinode, the "Freq" given in your example output is from the core0 of the socket in which the particular rank is running. But, I have no clue about "UFreq". How is this uncore frequency controlled? what is the impact of it in HPL and other workloads?

3. What are the optimal OS tuning that needs to be done to avoid OS jitter? When I use "htop", I do see the master threads going between red and green, which I infer it as context switching happening between HPL and OS scheduling process?

4. "Frac N PFact Bcast Swap Update PPerf BcstBW SwapBW CPU Kernel Total Powr Dpwr Tmp CFreq Ufreq WhoamI [rank,row,col]" - In this column, what does PFact, BCast, Swap, PPerf, BcstBW, SwapBW stand for? Can they be used as any KPIs to tune the HPL.dat file or some other settings?

McCalpinJohn · ‎11-18-2019

If I recall correctly, we did not do much BIOS-option tuning for our HPL runs -- we used Dell's "Maximum Performance Mode".

The Power Control Unit in the processor automagically balances the Uncore Frequency ("UFreq") against the core frequency when the chip is running at the maximum (TDP) power level. For HPL, you can typically get a slight (1%-3%) improvement in performance by manually forcing a lower frequency for the Uncore using MSR_UNCORE_RATIO_LIMIT (0x620, described in Volume 4 of the Intel Architectures SWDM). HPL does not require a lot of uncore traffic, so reducing the uncore frequency allows a bit more of the power budget to go to the cores without reducing performance.

I think Intel might have given us a recipe to follow for disabling timer interrupts on most of the cores, but I think that makes more difference on KNL than on SKX. My analysis shows that single-socket DGEMM performance reaches 91%-92% of peak when "peak" is computed using the actual frequency sustained, so there is not a lot of room for improvement. HPL is a little harder to characterize, since the power-limited frequency varies during the run, but if I recall correctly the numbers are similar. It takes a larger problem size to reach asymptotic performance with HPL (vs DGEMM), but the run-time required is still only a few minutes on a single node. The single-node (2-socket) performance varies by almost 13% across the nodes, so we grouped the nodes into four performance bins and used Intel's heterogeneous HPL implementation to divide the work proportionately.

The "best" options for block size and P,Q probably depend on the processor model and the interconnect performance. The 1MiB SKX L2 cache corresponds to a block size of NB=362. If I recall correctly, I sometimes got slightly better performance with larger block size (368 or 384), but performance was also more variable in those cases. In the end, I think NB=336 gave the best balance between performance and variability. This may depend on the number cores and L3 slices of your processors, since that will effect the effectiveness of the L3 victim cache for block sizes that result in slightly elevated L2 miss rates....

Surya_Narayanan_N_ · ‎11-21-2019

Thank You John.

I observe the inverse with the uncore frequency. If I increase uncore frequency, the memory bandwidth is more and hence single node performance is better. I see almost 40% drop in mem bandwidth when setting the MSR to min and max.

Also, I would like to understand what is the observable frequency during HPL runs. (https://en.wikichip.org/wiki/intel/xeon_platinum/8268). According to wikichips, when all cores are loaded and turbo is ON AVX2 should run at 3GHz and AVX512 based executables should run on 2.6GHz?[1] Why do I observe only 2.1-2.2GHz during peak of HPL and frequency reaching as referred in [1] by the end when the processor goes into spin loop?

Also, You have mentioned "so we grouped the nodes into four performance bins ". How was this done? Can you share more details on this?

Surya_Narayanan_N_ · ‎11-21-2019

Any explanation for this?

"Frac N PFact Bcast Swap Update PPerf BcstBW SwapBW CPU Kernel Total Powr Dpwr Tmp CFreq Ufreq WhoamI [rank,row,col]" - In this column, what does PFact, BCast, Swap, PPerf, BcstBW, SwapBW stand for? Can they be used as any KPIs to tune the HPL.dat file or some other settings?

McCalpinJohn · ‎11-21-2019

Bandwidth depends on uncore frequency, but highly optimized HPL benchmarks don't need very much bandwidth. The increase in core frequency made available by reducing the uncore frequency almost always improved HPL performance on my Xeon Platinum 8160 and 8280 nodes. The behavior might be different in processors with significantly different core counts....

As I discuss at http://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-intel-xeon-platinum-8160-processors/, I found that I needed to used 1GiB pages to get tightly repeatable HPL performance on the Xeon Platinum 8160. Once the compatible version of HPL was available, the HPL performance almost perfectly repeatable -- across a set of 31 nodes and 247 single-node HPL runs on each, the *largest* run-to-run variation in HPL performance on any node was only 0.7% (slide 21). The Intel HPL benchmark code includes support for heterogeneous operation using a static distribution of different numbers of columns to each node. This can be used do provide a static load-balancing for systems that are heterogeneous in processor type, but it can also be used to provide static load-balancing for systems that are heterogeneous in performance. https://software.intel.com/en-us/mkl-linux-developer-guide-heterogeneous-support-in-the-intel-distribution-for-linpack-benchmark

I don't know if the interim runtime updates from the Intel HPL distribution provide any directly actionable information for tuning the input parameters. The heterogeneous version of the code requires column-major mapping, which is not always the fastest for homogeneous runs. The best results from my analytical model and from actual runs typically have a P,Q decomposition with Q being 2-4 times larger than P.

Linpack result for Intel ® Xeon® Gold 6148 cluster