Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Unexpected power vs. cores profile for MKL kernels on modern Xeon Gold

antón_r_
Beginner
3,378 Views

Hi

I measured the consumed power of four kernels: dgemm, dtrsm, dsyrk and dpotrf (MKL 2018) using RAPL on a 20-core Xeon Gold 6138 for different core occupations: 1,2,4,5,10,20, and got the attached power and efficiency profiles.
The unexpected thing is that for the Xeon Gold the power profile gets saturated to the TDP (125W) for half-processor occupation, even though the Turbo boost is disabled and all the core frequencies are fixed.

This behavior is different from the linear profile seen in older architectures (see attached power and efficiency profiles for a Haswell processor). That linear behavior is important to be kept for the accuracy of many profile-based power and energy modeling in task-parallel runtime systems.

In terms of efficiency in gflop / joule, in modern Xeon Gold the half-occupation efficiency is significantly worse (around 30%) with respect to the full occupation case, while in the older Xeon E5 it gets quite stabilized. In terms of performance vs cores (flop/sec), the profiles are reasonably linear for both Gold and E5.

I would like to know why this not-linear power profile is happening - basically where the energy is going for 1 to half processor occupation. It has been reported ( https://repositories.lib.utexas.edu/handle/2152/61472 ) that in new architectures the Turbo frequency is set depending on the amount of active cores (risen for low core occupation and set stable to a base frequency for maximum core occupation). That could explain this not-linear profile if the Turbo were enabled, as few cores running at 3.7Ghz turbo frequency could easily saturate up to the TDP, but in this case (Turbo disabled) I cannot find any reason.


Other details:

  • Frequency of all cores fixed to the max base frequency (2GHz) from cpufreq-set
  • Turbo Boost disabled from BIOS.
  • Running on a 2-socket Xeon 6138 platform, restricting the execution to cores of first socket only.
  • For each core occupation, fixed with taskset the thread affinity to [0], [0,1], [0,1,2,3], ... etc.
  • Reading "package" counter in RAPL.
  • Kernel compute matrices of 24576x24576, but this behavior is consistently seen for other sizes.
0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
3,378 Views

(1) Did you use performance counters to verify that the frequency was actually what you asked for?

(2) Did you disable uncore frequency scaling in the BIOS?   (It can also be controlled at the OS level, using MSR 0x620 to control the minimum and maximum uncore frequencies, and monitored at the OS level by using setting bit 22 of MSR 0x703 to enable the UBox uncore cycle counter and MSR 0x704 to read the UBox cycle counts.)
 

View solution in original post

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
3,379 Views

(1) Did you use performance counters to verify that the frequency was actually what you asked for?

(2) Did you disable uncore frequency scaling in the BIOS?   (It can also be controlled at the OS level, using MSR 0x620 to control the minimum and maximum uncore frequencies, and monitored at the OS level by using setting bit 22 of MSR 0x703 to enable the UBox uncore cycle counter and MSR 0x704 to read the UBox cycle counts.)
 

0 Kudos
antón_r_
Beginner
3,378 Views

Thank you John.

As you suggested, a closer look to the individual core frequencies confirmed that in the full 20-core occupation case the frequency was being lowered to 1.7GHz, while for the 1-to-10-core occupations the frequency was maintained to 2GHz.
Interestingly, this frequency lowering policy is triggered when the default instruction set is enabled (AVX512). Contrary, when MKL_ENABLE_INSTRUCTIONS=AVX2 is set, all the 20 cores are able to work at 2GHz as expected.

Do you know where I can consult the detail of these frequency policies? Specifically, I'd like to know if they are statically fixed, modifiable, and/or dependent on some runtime thermal monitoring.

0 Kudos
McCalpinJohn
Honored Contributor III
3,378 Views

The full tables of maximum Turbo frequencies as a function of core count for Xeon Scalable processors is in the "Intel Xeon Processor Scalable Family Specification Update" (document 336065-005, February 2018).   Today this is available at https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf, but these things move around a lot.  

The data for all SKX processors is spread across Figures 1-15.    For the Xeon Gold 6138, the numbers are in

  • Figure 4, for "non-AVX" Turbo frequencies
    • This really means that you are not using 256-bit or 512-bit SIMD instructions.
    • These frequencies apply to code that uses no SIMD instructions.
    • These frequencies also apply to AVX or AVX2 code that only uses "scalar" SIMD instructions or 128-bit SIMD instructions.
    • I have not tested whether these frequencies apply to AVX-512 code that uses only "scalar" SIMD instructions or 128-bit SIMD instructions.
  • Figure 5, for "AVX 2.0" Turbo frequencies -- this really means that you are using 256-bit SIMD instructions.
    • I have not tested whether these frequencies apply to AVX-512 code that uses only 256-bit SIMD instructions.
  • Figure 6, for "AVX-512" Turbo frequencies -- this really means that you are using 512-bit SIMD instructions
  • There may be more surprises in store -- Table 19-3 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384-067) describes a new performance counter event "CORE_POWER" (Event 0x28) that includes a slightly different split of power levels:
    • Level 0 includes "non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes"
    • Level 1 includes "high current AVX 256-bit instructions as well as low current AVX 512-bit instructions"
    • Level 2 includes "high current AVX 512-bit instructions"

In each of the tables, there is a "Base frequency", which is essentially the frequency that Intel guarantees for the most power-hungry code using any number of cores for that SIMD type.   For the Xeon Gold 6138, these tables show:

  • "non-AVX": Base Frequency = 2.0 GHz, Max All-Core Turbo Frequency = 2.7 GHz
  • "AVX 2.0": Base Frequency = 1.6 GHz, Max All-Core Turbo Frequency = 2.3 GHz
  • "AVX-512":  Base Frequency = 1.3 GHz, Max All-Core Turbo Frequency = 1.9 GHz

The power control unit in the processor will attempt to run each core at the fastest frequency that it can, given the number of "active" processors (including C0 and C1 states), and the SIMD width in use by each core.   This is subject to three additional limitations:

  • Frequencies will be reduced if the processor reaches its maximum temperature.
    • This should be extremely rare.
    • If it happens frequently, you probably have a bad heat-sink installation, a bad fan, or blocked air-flow, and you should fix it.
  • Frequencies will be reduced to prevent the processor from exceeding its running average power limit.
    • This is extremely common, and the cumulative duration of frequency reduction due to this mechanism can be tracked using the RAPL MSR_PKG_PERF_STATUS (MSR 0x613).
    • Because of the non-linear relationship between power consumption and frequency, it is actually relatively easy to hit this limit even if you are not using all cores.
  • Frequencies will be reduced to prevent the processor from exceeding its maximum current limit.
    • This is not well documented, but there are hints in the description of the various bits of the MSR_CORE_PERF_LIMIT_REASONS register (MSR 0x64f).

So your results look quite reasonable:

  • Your 20-core AVX512 test running at 1.7 GHz is well above the minimum guaranteed frequency of 1.3 GHz, and somewhat below the maximum all-core AVX512 Turbo frequency of 1.9 GHz. 
    • This seems typical.  We have about 3500 Xeon Platinum 8160 processors, and none of them drop down all the way to the minimum guaranteed AVX-512 frequency (1.4 GHz) when running compute-intensive workloads.  We see a continuous distribution of average frequencies from ~1.55 GHz to almost 1.8 GHz (vs a max all-core AVX512 Turbo frequency of 2.0 GHz).
  • Your 20-core AVX2 test requested 2.0 GHz and got it.  This is well above the minimum guaranteed frequency of 1.6 GHz, and well below the maximum all-core AVX2 Turbo frequency of 2.3 GHz.  

 

0 Kudos
antón_r_
Beginner
3,378 Views

Great. This is exactly the information we need to conduct reliable power and performance modeling.

Thank you very much for your quick and detailed reply.

0 Kudos
Reply