Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

rdpmc usage from inside Cloud Native Container

jagdish1
Beginner
1,999 Views

I want to check if an application inside a Cloud Container can call rdpmc instruction. The cloud native container is running on an x86 platform. Any inputs is much appreciated.

 

If it is not possible, any other alternative to rdpmc ? 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,719 Views

The fixed-function performance counters depend on two MSRs to be enabled and configured.

  • MSR 0x38d IA32_FIXED_CTR_CTRL has a 4-bit field for each of the 3 or 4 fixed-function counters.
    • bit 0: Enable fixed-function counter 0 (Instructions Retired) to count in kernel mode
    • bit 1: Enable fixed-function counter 0 to count in user mode
    • bit 2: "Anythread" -- if 1, increment counter for activity in any logical processor running on this core, otherwise only increment for the specific logical processor that programmed the MSR
    • bit 3: Enable fixed-function counter 0 to generate an interrupt on overflow
    • bits 4-7: same controls for fixed-function counter 1 (Core Clocks Unhalted)
    • bits 8-11: same controls for fixed-function counter 2 (Reference Clocks Unhalted)
    • bits 12-15: same controls for fixed-function counter 3 (Topdown slots) (only supported on very recent processors)
    • For SKX processors, you want either 0x700000ff or 0x7000000f to enable all three fixed-function counters and all the programmable counters (8 without HyperThreading, 4 with HyperThreading).
  • MSR 0x38f IA32_PERF_GLOBAL_CTRL
    • bits 7:0 are set to enable programmable counters 7:0
      • Only bits 3:0 can be set on configurations that only have 3 PMCs per core -- e.g., processors before ICX when running with HyperThreading enabled.
    • bit 32: set to 1 to enable fixed counter 0
    • bit 33: set to 1 to enable fixed counter 1
    • bit 34: set to 1 to enable fixed counter 2
    • bit 48: set to 1 to enable fixed counter 3 (if supported)

Operating systems (and hypervisors) set their own policies for how these MSRs are configured.  Most commonly I see all of the PMCs enabled in IA32_PERF_GLOBAL_CTRL (MSR 0x38f), but they are frequently disabled in IA32_FIXED_CTR_CTRL (MSR 0x38d).  In particular, the "NMI Watchdog" often uses one of the fixed-function counters with interrupt-on-overflow enabled to help identify processors that are "hung" and non-responsive. The NMI watchdog can be disabled with no ill effects ("echo 0 > /proc/sys/kernel/nmi_watchdog"), then set IA32_FIXED_CTR_CTRL to the desired value (using 0x333 -- enables fixed counters 0,1,2 to count in both user and kernel mode).

View solution in original post

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
1,908 Views

Permission for user code to execute the RDPMC instruction is controlled by a hardware configuration bit (CR4.PCE, "Performance Counter Enable").  Recent versions of Linux (kernel >= 4) have changed to disabling RDPMC by default and requiring a system call through the "perf" library to re-enable it.  On the 4.18 kernels that we are running on our newest production systems the behavior can be reverted (until the next boot) by executing "echo 2 > /sys/devices/cpu/rdpmc" (as root).  

The developers of the Linux kernel want all performance counter access to go through the "perf events" subsystem, and are slowly but systematically making it harder to access any other way.

Performance counter libraries such as likwid (https://github.com/RRZE-HPC/likwid) or PAPI (https://github.com/icl-utk-edu/papi) support access by the "perf events" interface.  Because RDPMC reads logical-processor-specific register values that are not part of the standard process context, it is only safe to use if the process is bound to a single logical processor.  My understanding is that the underlying "perf events" interface may or may not use the user-level RDPMC instruction to get the counter values (the alternative is entering the kernel and reading the counters with the RDMSR instruction), but I don't understand the implementation(s).

0 Kudos
Deepak123
Beginner
1,789 Views

Hello John,

 

I wanted to express my gratitude for your response and valuable suggestion. I & Jagdish works in a team.

 

Following your advice, I executed "echo 2 > /sys/devices/cpu/rdpmc" as root on the cloud container, which enabled me to successfully execute the `rdpmc` instruction using `__asm__ __volatile__("rdpmc" : "=a"(low), "=d"(high) : "c"(counter))`.

 

During my research on Intel community forums related to RDPMC, I came to understand that passing `1<<30` to `rdpmc` allows retrieving the `eax` and `edx` register data, representing CPU retired instructions.

 

However, I encountered a issue. When executing `__asm__ __volatile__("rdpmc" : "=a"(low), "=d"(high) : "c"(1<<30))`  I consistently obtained the same value for both `low` and `high`. It appears that the values of `low` and `high` are not updating after the initial read. For instance, the values were repeatedly as follows: `low: 3465705966`, `high: 42833`. Tried this running in loop & the low, high values came same.

Moreover, when attempting `(1<<30) + 1` and `(1<<30) + 2`, both resulted in zero values for both `low` and `high`. 

I am trying this on Intel Xeon Gold 6338N architecture.

 

I would greatly appreciate any further insights you may have on this matter.

 

Looking forward to your response.

 

Thank you and best regards

Deepak

0 Kudos
jagdish1
Beginner
1,731 Views

Hello John,

 

Thanks much for your inputs.

After executing "echo 2 > /sys/devices/cpu/rdpmc", now able to call rdpmc in the program.

 

Unfortunately, "__asm__ __volatile__("rdpmc" : "=a"(low), "=d"(high) : "c"(1<<30))" is returning 0 for both low and high.

 

Any idea if it is problem with hardware what we use - Intel(R) Xeon(R) Gold 6338N CPU ? or something else ? Any registers to be set to enable the counter or something ?

 

Any suggestions on what could be checked next ? Any documents that could be referred or should I contact the HW folks to understand ? 

 

Thanks,
Jagdish

0 Kudos
McCalpinJohn
Honored Contributor III
1,720 Views

The fixed-function performance counters depend on two MSRs to be enabled and configured.

  • MSR 0x38d IA32_FIXED_CTR_CTRL has a 4-bit field for each of the 3 or 4 fixed-function counters.
    • bit 0: Enable fixed-function counter 0 (Instructions Retired) to count in kernel mode
    • bit 1: Enable fixed-function counter 0 to count in user mode
    • bit 2: "Anythread" -- if 1, increment counter for activity in any logical processor running on this core, otherwise only increment for the specific logical processor that programmed the MSR
    • bit 3: Enable fixed-function counter 0 to generate an interrupt on overflow
    • bits 4-7: same controls for fixed-function counter 1 (Core Clocks Unhalted)
    • bits 8-11: same controls for fixed-function counter 2 (Reference Clocks Unhalted)
    • bits 12-15: same controls for fixed-function counter 3 (Topdown slots) (only supported on very recent processors)
    • For SKX processors, you want either 0x700000ff or 0x7000000f to enable all three fixed-function counters and all the programmable counters (8 without HyperThreading, 4 with HyperThreading).
  • MSR 0x38f IA32_PERF_GLOBAL_CTRL
    • bits 7:0 are set to enable programmable counters 7:0
      • Only bits 3:0 can be set on configurations that only have 3 PMCs per core -- e.g., processors before ICX when running with HyperThreading enabled.
    • bit 32: set to 1 to enable fixed counter 0
    • bit 33: set to 1 to enable fixed counter 1
    • bit 34: set to 1 to enable fixed counter 2
    • bit 48: set to 1 to enable fixed counter 3 (if supported)

Operating systems (and hypervisors) set their own policies for how these MSRs are configured.  Most commonly I see all of the PMCs enabled in IA32_PERF_GLOBAL_CTRL (MSR 0x38f), but they are frequently disabled in IA32_FIXED_CTR_CTRL (MSR 0x38d).  In particular, the "NMI Watchdog" often uses one of the fixed-function counters with interrupt-on-overflow enabled to help identify processors that are "hung" and non-responsive. The NMI watchdog can be disabled with no ill effects ("echo 0 > /proc/sys/kernel/nmi_watchdog"), then set IA32_FIXED_CTR_CTRL to the desired value (using 0x333 -- enables fixed counters 0,1,2 to count in both user and kernel mode).

0 Kudos
jagdish1
Beginner
1,338 Views

Hello John,

 

As suggested, used msrtools (now ver 1.3) and set "wrmsr -a 0x38d 0x707". Now rdpmc call is returning values for the Instructions retired which is what we were trying to achieve. Thanks very much for helping on this. Appreciate much !

 

On a related note, as I was exploring the msrtools, I was advised the following,

Tool itself looks to be old and not maintained anymore. Repository https://github.com/intel/msr-tools is set to be read only. Intel is not accepting any new pull requests. Fedora is still providing package, but it looks that since f38 they are failing to build it. It’s likely that Fedora will drop package at some point.

Please be alerted, recommendation is to start looking for some alternative tool which would replace msr-tools in your testing.

 

So the question I had is

1) Is there any way for a Process to set value of 0x707 in register 0x38d, or msrtools is the only option available  ? 

2) Is the above alert true and if any alternative tool to replace msr-tools is available which I can explore ?

 

Thanks,
Jagdish

0 Kudos
McCalpinJohn
Honored Contributor III
1,244 Views

Reading and writing MSRs requires executing the RDMSR and WRMSR instructions and these can only be executed in kernel mode.  "msr-tools" uses the existing /dev/cpu/*/msr sysfile interface to make the request that the kernel execute the desired RDMSR/WRMSR commands.  There is no need to use msr-tools -- most of my performance monitoring programs open one or more of the /dev/cpu/*/msr drivers and use "pread()" and "pwrite()" calls to make the requests.   There are many examples of this access mode in https://github.com/jdmccalpin/periodic-performance-counters/blob/master/perf_counters.c.

Executables accessing the /dev/cpu/*/msr tools need to be run by root or have the setuid root attribute.  In the latter case, some versions of Linux also require that the executable be tagged with a "capability".  Unfortunately it is extremely difficult to find documentation on capabilities.  By trial and error I figured out that I needed something like the following to set the capabilities of reading/writing MSRs through the /dev/cpu/*/msr interface and to read /proc/self/pagemap (to perform virtual to physical address translations):

setcap cap_sys_admin,cap_sys_rawio+ep <executable>

I would not be surprised if the Linux kernel folks deprecated the /dev/cpu/*/msr interface at some point -- they seem dedicated to making it harder for me to access the hardware.  In that case you would need to create your own kernel module to execute the RDMSR & WRMSR instructions.  

0 Kudos
Reply