Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

cpufreq-driver cannot adjust per-core frequency and how to adjust it with MSR?

hiratz
Novice
7,343 Views

Hi,

Recently I'm trying to adjust each logical cpu's frequency on a Intel Xeon Broadwell-EP 8-core/16-thread processor (Ubuntu 16.04, HP Proliant Gen9, 2.1 GHz base frequency, 1.2 GHz minimum frequency). But I observed some phenomena I cannot explain:

(Note: the discussion of the following 1) and 2) has Turbo-boost disabled!)

1) If I disable Intel-pstate driver by modifying the grub, Linux will use the pcc-cpufreq (with BIOS Power management using "Balanced power and performance" or "Maximum Performance") or acpi-cpufreq (with BIOS Power management using "OS control mode"). Whichever is used, the frequency could be "adjusted"(not really, explained later) with the following two methods (take cpu1 for example, the cpufreq directory is /sys/devices/system/cpu/cpu1/cpufreq/)

   a)  use the "userspace" as the scaling_governor and write a legal frequency (say 1800000) value to file "scaling_setspeed" (1800000 uses kHz as its unit, so it means 1.8GHz)

  or b) use the "performance" as the scaling_governor and write a legal frequency value to file "scaling_max_freq"

  If we run the command line "watch grep MHz /proc/cpuinfo", a) or b) will show cpu1 does has a new frequency 1800 MHz (other cpus keep the original one: 2100 MHz). However, my test program shows its run time keep not changed no matter whatever frequency value is written to the above files. In other words, "/proc/cpuinfo" shows cpu1's frequency has been changed but the test program's execution time does not show it is changed!

2) If I enable Intel-pstate in grub and use the above method "b)" (Note: now "a)" cannot be used because there is no "userspace" governor when "scaling_driver" is "intel_pstate") to change cpu1's max frequency, the test program's run time changed significantly across different frequency values. 

With the above 1) and 2) , turbo-boost is disabled. If I enable the turbo-boost in BIOS, it seems that cpufreq driver cannot control cpu1's frequency very accurately.

Then, I looked into the code in intel_pstate.c and found it used the function of "HWP". Now I have several questions that confuse me a lot and hopefully someone could help explain them:

1  According to the above observations, why can Intel-pstate really change a cpu's frequency but other cpufreq drivers cannot (by "really", I mean the changing effect should be observed by running a test program)? 

2 With turbo boost enabled, is each core's frequency controlled by hardware completely and OS can do nothing?  (Or as some discussion said, only the frequencies that are larger than base frequency are controlled by hardware and other frequencies are still controlled by OS's cpufreq driver??)

Intel Manual (SDM-3B, 2016) shows several different techniques to adjust frequency and/or voltages such as:

   (1) Chapter 14.3.2 "System Software Interfaces for Opportunistic Processor Performance Operation"

        It uses a msr called "IA32_PERF_CTL" and " IA32_PERF_STATUS"

   (2) Chapter 14.3.3 "Intel® Turbo Boost Technology"

   (3) Chapter 14.4 "HARDWARE-CONTROLLED PERFORMANCE STATES (HWP)"

        The HWP seems a very complex control mechanism, which uses many MSRs.

   So  what's the difference among them (especially between using "IA32_PERF_CTL" and using the MSR in HWP like "IA32_HWP_REQUEST")? 

   Has the "IA32_PERF_CTL" been deprecated on recent processors (replaced by HWP)?

In the following post "https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/746548?language=en-us&https=1",

John mentioned he used  IA32_HWP_REQUEST register to control the frequency directly, but Chapter 14.3 shows this register does not provide a field that can be filled with a frequency value. So how could that be done?

I need to dynamically adjust (down) each logical cpu's frequency (with turbo boost disabled) in my kernel module by writing some MSR and observe the system performance under multiprogramming. But I'm so confused with so many methods that are listed above and just don't know which one I should use. BTW, I'm not going to use intel-pstate mechanism because it adjusts both frequency and voltage. I only need to adjust the frequency. 

Any help is really appreciated!

 

 

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
7,343 Views

I have only used HWP on SKX processors, not on Broadwell, but the principles should be the same.

With HWP disabled, you don't have to have the userspace driver to set a specific frequency -- you can use the "cpupower frequency-set" driver and simply set the minimum and maximum frequencies to the same value, e.g.,

cpupower frequency-set --min 1800M --max 1800M

(I can't remember if this works correctly -- I don't have any systems with HWP disabled and OS frequency control enabled to test at the moment.)

---- IMPORTANT INFORMATION !!!! -----

  • No matter what frequencies you request, processor cores typically drop to the "maximum efficiency" frequency when idle. 
    • For the Sandy Bridge through Haswell Xeon processors that I have looked at, this is 1.2 GHz, while for Skylake Xeon processors it is 1.0 GHz.  
  • When you start running on a core, it starts at the "maximum efficiency" frequency, then ramps to a frequency in the range of the specified minimum to maximum (provided that all the other constraints have been met).  
    • The time required to ramp to full speed depends on other settings, including the energy-performance bias (which is another twisted story that I don't have time to go into today). 
  • Lots of utilities -- including "cpupower frequency-info" -- do not keep the processor active long enough to get a reliable frequency measurement. 
  • To make it worse, "cpupower frequency-info" measures the frequency on core 0, even if the "cpupower" application is running on a different core.  If core 0 is not busy running some other process, the command will probably return a frequency very close to the "maximum efficiency" frequency. 

To get reasonably accurate frequency measurements, I recommend running something that has an execution time of at least several seconds and using "perf stat" to get the average frequency.

----------------------------------------------------------

Now back to HWP....

If HWP has been enabled (IA32_PM_ENABLE, MSR 0x770 set to 1), then it overrides all of the legacy interfaces (most importantly, IA32_PERF_CTL).  Once HWP has been enabled, it cannot be disabled without a processor RESET (reboot).   Unfortunately, the interaction between HWP and the legacy HW and SW interfaces is a mess.  I think that IA32_PERF_STATUS still gives the current (instantaneous) frequency, but the act of crossing into the kernel to read the MSR is enough to cause the frequency to change, so this is not a very useful feature. (Intel recommends computing the average frequency over intervals, rather than looking at the instantaneous frequency.)

Because of this "one-time" enable feature, a typical configuration is for the BIOS to recognize that the processor supports HWP, but for the BIOS to refrain from enabling it.  As you have seen in the intel_pstate.c driver, the default behavior of intel_pstate is to enable HWP if it is supported (as reported by CPUID -- section 14.4.1 of Volume 3 of the SWDM).

Once enabled, the "cpupower frequency-set" command will set the HWP registers.  With HWP enabled, there are no longer software "governors", but using "cpupower frequency-set --governor=[performance,powersave]" will result in different settings in the IA32_HWP_REQUEST register that roughly correspond to what you would expect from the names.  There are a lot of features in HWP that can't be selected using the "cpupower" utilities, which is one reason why I built my own driver.  

The other reason I built my own driver is that every version of Linux that I have looked at is BROKEN and sets the wrong values in the IA32_HWP_REQUEST registers.   For example, on a system running CentOS 7.4 (kernel 3.10.0-693.17.1), the cpupower command above sets the IA32_HWP_REQUEST minimum and maximum frequencies to 2100, rather than the 1800 I explicitly requested.  Requesting 1000 MHz for minimum and maximum results in setting the frequency to 1500 MHz.   I don't expect the Linux kernel guys to be gurus in higher math, but this is embarrassing.   (Similar issues apply to the /sys/devices/system/cpu/intel_pstate/* interfaces, which also can't do linear transformations correctly.)

The IA32_HWP_REQUEST MSR definitely has fields for frequency requests.  Just like IA32_PERF_CTL, the IA32_HWP_REQUEST MSR uses 8-bit fields to hold core frequency multipliers relative to the 100 MHz reference clock.   The difference is that IA32_PERF_CTL can only request one value, while IA32_HWP_REQUEST allows you to specify separate values for minimum, maximum, and "desired" ratios.   To request a single frequency, simply set the minimum and maximum values to be the same.   For example, to set the frequency to 1800 MHz, the ratio should be 0x12, and the IA32_HWP_REQUEST MSR on each core should be set to 0x00001212.   The bit fields set are:

  • Use the default values for all of the high-order fields (bits 63:32)
  • Energy-Performance Preference (bits 31:24) set to zero (maximum performance)
  • Desired Performance (bits 23:16) set to zero (let the hardware decide)
  • Maximum Performance (bits 15:8) set to 0x12 (decimal 18) for 1.8 GHz max frequency
  • Minimum Performance (bits 7:0) set to 0x12 (decimal 18) for 1.8 GHz max frequency

After writing this value on all cores, I ran a short single-threaded STREAM test with "perf stat", which reported an average frequency of exactly 1.8 GHz, as desired:

# perf stat taskset -c 1 ./stream.exe.uni
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------

[...]
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 10 times.
[...]
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:        9219.4536       0.1429       0.1388       0.1728
Scale:       9063.4223       0.1489       0.1412       0.2058
Add:        11376.9066       0.1736       0.1688       0.2028
Triad:      11391.3420       0.1693       0.1685       0.1703
-------------------------------------------------------------
[...]

 Performance counter stats for 'taskset -c 1 ./stream.exe.uni':

       7827.626624      task-clock (msec)         #    0.997 CPUs utilized          
                73      context-switches          #    0.009 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             6,499      page-faults               #    0.830 K/sec                  
    14,089,539,729      cycles                    #    1.800 GHz                    
     9,465,165,132      instructions              #    0.67  insn per cycle         
     1,359,730,774      branches                  #  173.709 M/sec                  
           169,589      branch-misses             #    0.01% of all branches        

       7.849085849 seconds time elapsed

 

View solution in original post

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
7,344 Views

I have only used HWP on SKX processors, not on Broadwell, but the principles should be the same.

With HWP disabled, you don't have to have the userspace driver to set a specific frequency -- you can use the "cpupower frequency-set" driver and simply set the minimum and maximum frequencies to the same value, e.g.,

cpupower frequency-set --min 1800M --max 1800M

(I can't remember if this works correctly -- I don't have any systems with HWP disabled and OS frequency control enabled to test at the moment.)

---- IMPORTANT INFORMATION !!!! -----

  • No matter what frequencies you request, processor cores typically drop to the "maximum efficiency" frequency when idle. 
    • For the Sandy Bridge through Haswell Xeon processors that I have looked at, this is 1.2 GHz, while for Skylake Xeon processors it is 1.0 GHz.  
  • When you start running on a core, it starts at the "maximum efficiency" frequency, then ramps to a frequency in the range of the specified minimum to maximum (provided that all the other constraints have been met).  
    • The time required to ramp to full speed depends on other settings, including the energy-performance bias (which is another twisted story that I don't have time to go into today). 
  • Lots of utilities -- including "cpupower frequency-info" -- do not keep the processor active long enough to get a reliable frequency measurement. 
  • To make it worse, "cpupower frequency-info" measures the frequency on core 0, even if the "cpupower" application is running on a different core.  If core 0 is not busy running some other process, the command will probably return a frequency very close to the "maximum efficiency" frequency. 

To get reasonably accurate frequency measurements, I recommend running something that has an execution time of at least several seconds and using "perf stat" to get the average frequency.

----------------------------------------------------------

Now back to HWP....

If HWP has been enabled (IA32_PM_ENABLE, MSR 0x770 set to 1), then it overrides all of the legacy interfaces (most importantly, IA32_PERF_CTL).  Once HWP has been enabled, it cannot be disabled without a processor RESET (reboot).   Unfortunately, the interaction between HWP and the legacy HW and SW interfaces is a mess.  I think that IA32_PERF_STATUS still gives the current (instantaneous) frequency, but the act of crossing into the kernel to read the MSR is enough to cause the frequency to change, so this is not a very useful feature. (Intel recommends computing the average frequency over intervals, rather than looking at the instantaneous frequency.)

Because of this "one-time" enable feature, a typical configuration is for the BIOS to recognize that the processor supports HWP, but for the BIOS to refrain from enabling it.  As you have seen in the intel_pstate.c driver, the default behavior of intel_pstate is to enable HWP if it is supported (as reported by CPUID -- section 14.4.1 of Volume 3 of the SWDM).

Once enabled, the "cpupower frequency-set" command will set the HWP registers.  With HWP enabled, there are no longer software "governors", but using "cpupower frequency-set --governor=[performance,powersave]" will result in different settings in the IA32_HWP_REQUEST register that roughly correspond to what you would expect from the names.  There are a lot of features in HWP that can't be selected using the "cpupower" utilities, which is one reason why I built my own driver.  

The other reason I built my own driver is that every version of Linux that I have looked at is BROKEN and sets the wrong values in the IA32_HWP_REQUEST registers.   For example, on a system running CentOS 7.4 (kernel 3.10.0-693.17.1), the cpupower command above sets the IA32_HWP_REQUEST minimum and maximum frequencies to 2100, rather than the 1800 I explicitly requested.  Requesting 1000 MHz for minimum and maximum results in setting the frequency to 1500 MHz.   I don't expect the Linux kernel guys to be gurus in higher math, but this is embarrassing.   (Similar issues apply to the /sys/devices/system/cpu/intel_pstate/* interfaces, which also can't do linear transformations correctly.)

The IA32_HWP_REQUEST MSR definitely has fields for frequency requests.  Just like IA32_PERF_CTL, the IA32_HWP_REQUEST MSR uses 8-bit fields to hold core frequency multipliers relative to the 100 MHz reference clock.   The difference is that IA32_PERF_CTL can only request one value, while IA32_HWP_REQUEST allows you to specify separate values for minimum, maximum, and "desired" ratios.   To request a single frequency, simply set the minimum and maximum values to be the same.   For example, to set the frequency to 1800 MHz, the ratio should be 0x12, and the IA32_HWP_REQUEST MSR on each core should be set to 0x00001212.   The bit fields set are:

  • Use the default values for all of the high-order fields (bits 63:32)
  • Energy-Performance Preference (bits 31:24) set to zero (maximum performance)
  • Desired Performance (bits 23:16) set to zero (let the hardware decide)
  • Maximum Performance (bits 15:8) set to 0x12 (decimal 18) for 1.8 GHz max frequency
  • Minimum Performance (bits 7:0) set to 0x12 (decimal 18) for 1.8 GHz max frequency

After writing this value on all cores, I ran a short single-threaded STREAM test with "perf stat", which reported an average frequency of exactly 1.8 GHz, as desired:

# perf stat taskset -c 1 ./stream.exe.uni
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------

[...]
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 10 times.
[...]
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:        9219.4536       0.1429       0.1388       0.1728
Scale:       9063.4223       0.1489       0.1412       0.2058
Add:        11376.9066       0.1736       0.1688       0.2028
Triad:      11391.3420       0.1693       0.1685       0.1703
-------------------------------------------------------------
[...]

 Performance counter stats for 'taskset -c 1 ./stream.exe.uni':

       7827.626624      task-clock (msec)         #    0.997 CPUs utilized          
                73      context-switches          #    0.009 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             6,499      page-faults               #    0.830 K/sec                  
    14,089,539,729      cycles                    #    1.800 GHz                    
     9,465,165,132      instructions              #    0.67  insn per cycle         
     1,359,730,774      branches                  #  173.709 M/sec                  
           169,589      branch-misses             #    0.01% of all branches        

       7.849085849 seconds time elapsed

 

0 Kudos
hiratz
Novice
7,343 Views

Excellent explanations!  Thank you so much, John!

It is definitely a good way to build a driver of our own because it is not subject to the change of linux kernel (I also use my own driver, so this is why I ask how to adjust frequency by MSRs here. Most of the time I use my driver to test, observe, verify or evaluate the processor’s feature instead of using some commonly used tools/utilities except the case in which I just want to quickly observe something simple).

For the utility "cpupower", unfortunately, it does not work in my current system because it says it does not match with my current kernel version ... So I don't know its effect.

According to your descriptions, it seems only cpupower and intel-pstate can use HWP and other utilities like "acpi-cpufreq" or "pcc-cpufreq" cannot, is this the case?

If I understand it correctly, when HWP is enabled, cpupower and intel-pstate have some similar underlying actions, namely, both of them use the HWP instead of the legacy IA32_PERF_CTL. But intel-pstate may do more than cpupower; But when HWP is disabled,  all these drivers including some others like "acpi-cpufreq" or "pcc-cpufreq" (https://wiki.archlinux.org/index.php/CPU_frequency_scaling)  use the legacy IA32_PERF_CTL, right? (However, I looked into the source files such as "cpufreq.c, cpufreq_governor.c, cpufreq_performance.c, etc." but found there are no operations related to  "IA32_PERF_CTL" or MSRs in HWP).

Theoretically, I think any user could build a driver which can implement all functions of both Intel-pstate and cpupower, right?

The last example exactly shows how to achieve what I want. It's really good! I'll try this in my computer soon and see how it works.

 

Best

0 Kudos
akostadinov
Beginner
826 Views

I know this is an old thread. But for anybody that came here by search, there is a kernel tool to control the states.

 

https://www.kernel.org/doc/html/latest/admin-guide/pm/intel-speed-select.html

0 Kudos
hiratz
Novice
7,343 Views

Update: Just found Broadwell(-EP) does not support HWP (Intel Speed Shift) ( CPUID.06H:EAX[bit 7] is not set). So I have to consider the legacy IA32_PERF_CTL.

0 Kudos
McCalpinJohn
Honored Contributor III
7,343 Views

If I remember correctly, if you disable the intel-pstate driver (with a kernel boot option), the system will default to acpi-cpufreq.   The acpi-cpufreq driver does not enable HWP, and so the legacy interfaces remain in operation.

"cpupower" is a command-line utility that will attempt to interface with whatever kernel frequency control mechanism is active, but I have never tried to use it with acpi-cpufreq -- I always used the /sys/devices/system/cpu/cpu*/cpufreq/* interfaces instead.   Now that I have gotten used to the new HWP interface, I prefer it to the old approach.  As I noted before, the user-level configuration tools appear to be horribly hacked to work with HWP, so I avoid them entirely and use my own command-line tool to set the HWP registers.

For my kernel (3.10.0-693), the file (kernel_source)/drivers/cpufreq/acpi-cpufreq.c uses MSR_IA32_PERF_CTL for frequency control.

0 Kudos
hiratz
Novice
7,343 Views

Yes, you are right. I missed checking the file acpi-cpufreq.c that does use "MSR_IA32_PERF_CTL".

Basically, I have no more questions. Thank you again!

Best

0 Kudos
Reply