Solved: Configuring the PMU to read Perfomance Counters

Singh__Nikhilesh · ‎03-28-2019

I am using Linux 4.19.2 (ubuntu 16.04) on an intel i7-3770 CPU. I have used tools like perf for performance monitoring. I want to write a basic piece of code to read performance counters without using any such tools.

I came across the rdpmc instruction. Before using it, how can I configure the registers to count a particular event? I want to read from kernel code itself, so there's no issue of privileges. What is the easiest way to do it?

Also, is rdpmc the least overhead way to this?

McCalpinJohn · ‎03-29-2019

(1) From user-space, RDPMC has *much* lower latency than any other way of getting the performance counters. RDPMC and RDMSR probably have similar raw latencies from kernel code, but you have to do some additional work to avoid much larger overheads in the required code.

The larger overheads come from the fact that the performance counters are logical-processor-level, non-virtualized hardware resources. Differences in counts will only make sense if the "before" and "after" counts were obtained by executing RDPMC on the same logical processor.

The kernel has an interface "rdmsr_on_cpu()" (defined in arch/x86/lib/msr-smp.c in my kernel) that handles all the ugly details of setting up an inter-processor interrupt to get the target processor to execute the RDMSR instruction. The overhead of executing the interprocessor interrupt is very high compared to the overhead of the execution of the RDMSR or RDPMC instruction.

I have not worked with process binding within kernel code, but if you can bind the kernel thread/process that is reading the counters, then you can avoid the inter-processor interrupt and get the lowest latency by simply executing the RDPMC instruction (or the corresponding RDMSR instruction). The kernel provides a C function "native_read_pmc()" (in arch/x86/include/asm/msr.h) that you can use if you know that the kernel thread is bound to the correct target processor, and a macro "rdpmcl()" that executes "native_read_rdpmc()" and combines the two 32-bit output fields of the RDPMC instruction into a single 64-bit value. There are several other useful macros and functions in msr.h that you might want to review....

Aside: Also, the counters are 48 bits wide and you need to read the counters frequently enough that you can guarantee unambiguous correction for wrap-around. The maximum interval will depend on the event you are measuring, since different performance counter events have different maximum rates at which they can be incremented.

(2) Performance counter programming is described in Chapter 18 of Volume 3 of the Intel Architectures Software Developers Manual. The performance counters are controlled by Model-Specific Registers (MSRs) that are written and read using the WRMSR and RDMSR instructions. These two instructions can only be executed in kernel mode. The kernel provides a function "wrmsr_on_cpu()" to set up the interprocessor interrupts to execute the WRMSR instruction on the target processor.

From user space, Linux provides a device driver (/dev/cpu/*/msr) that can be used to request that the kernel execute these instructions on specific cores. The command-line executables provided by the msr-tools package (https://github.com/intel/msr-tools) provide easy shell access to the MSR interface and easy-to-understand examples of how to access the MSR device drivers in C code.

View solution in original post

McCalpinJohn · ‎03-29-2019

(1) From user-space, RDPMC has *much* lower latency than any other way of getting the performance counters. RDPMC and RDMSR probably have similar raw latencies from kernel code, but you have to do some additional work to avoid much larger overheads in the required code.

The larger overheads come from the fact that the performance counters are logical-processor-level, non-virtualized hardware resources. Differences in counts will only make sense if the "before" and "after" counts were obtained by executing RDPMC on the same logical processor.

The kernel has an interface "rdmsr_on_cpu()" (defined in arch/x86/lib/msr-smp.c in my kernel) that handles all the ugly details of setting up an inter-processor interrupt to get the target processor to execute the RDMSR instruction. The overhead of executing the interprocessor interrupt is very high compared to the overhead of the execution of the RDMSR or RDPMC instruction.

I have not worked with process binding within kernel code, but if you can bind the kernel thread/process that is reading the counters, then you can avoid the inter-processor interrupt and get the lowest latency by simply executing the RDPMC instruction (or the corresponding RDMSR instruction). The kernel provides a C function "native_read_pmc()" (in arch/x86/include/asm/msr.h) that you can use if you know that the kernel thread is bound to the correct target processor, and a macro "rdpmcl()" that executes "native_read_rdpmc()" and combines the two 32-bit output fields of the RDPMC instruction into a single 64-bit value. There are several other useful macros and functions in msr.h that you might want to review....

Aside: Also, the counters are 48 bits wide and you need to read the counters frequently enough that you can guarantee unambiguous correction for wrap-around. The maximum interval will depend on the event you are measuring, since different performance counter events have different maximum rates at which they can be incremented.

(2) Performance counter programming is described in Chapter 18 of Volume 3 of the Intel Architectures Software Developers Manual. The performance counters are controlled by Model-Specific Registers (MSRs) that are written and read using the WRMSR and RDMSR instructions. These two instructions can only be executed in kernel mode. The kernel provides a function "wrmsr_on_cpu()" to set up the interprocessor interrupts to execute the WRMSR instruction on the target processor.

From user space, Linux provides a device driver (/dev/cpu/*/msr) that can be used to request that the kernel execute these instructions on specific cores. The command-line executables provided by the msr-tools package (https://github.com/intel/msr-tools) provide easy shell access to the MSR interface and easy-to-understand examples of how to access the MSR device drivers in C code.

Singh__Nikhilesh · ‎05-01-2019

Thanks a lot, Sir for the answer. It helped a lot.

I have a slightly different concern now. As per my understanding, the performance registers are with each of the cores. So for my system with 8 logical cores, I can count events related to a core on its performance registers.

My doubt is, what happens when I try to count an event related to resources which are shared among multiple cores, e.g the last level cache. If for a piece of code suppose I take a reading of the LLC before and after it. Isn't it corrupted by the other cores using the LLC?

My second question is, which event to select to count values for L1, L2 and L3 data and instruction-cache misses? I checked the manual for Ivy bridge and couldn't figure it out exactly. I am attaching the relevant part of the manual here for reference.

McCalpinJohn · ‎05-02-2019

This gets complicated fast....

Most of the counters are *intended* to count only events due to instructions issued by the same logical processor. The cache hit rate seen by a logical processor will depend on what its "cache sibling" is doing, but the counts of hits and misses will only pertain to loads and stores (and prefetches) issued by the same logical processor.
Some counter events have "core scope" and will return the same values if read by either core.
Some events related to shared resources are difficult (or impossuble) to assign to a specific logical processor. Cache writebacks and L2 hardware prefetches are examples. I typically only use one logical processor per core, so I have not researched how these events are assigned when both logical processors are actively using the shared resource.
Some counter events have bugs. Some of the bugs are documented in the processor "specification update" document, some are documented in Chapter 19 of Volume 3 of the software developers manual, and some are not documented anywhere.

Counting cache hits and misses is much harder than one might initially suspect.

There are many different kinds of memory access: loads, stores, software prefetches, 2 L1 HW prefetchers, the core "next-page-prefetcher", 2 L2 HW prefetchers, instruction fetches, instruction cache prefetches, and many other transactions originating from outside the cores.
In the core, transactions are usually counted by instruction (or uop), independent of the size of the transaction. For the Ivy Bridge core, instructions can access bytes, words (2B), double-words (4B), quad-words (8B), 80-bit FP (10B), 128-bit SIMD (16B), or 256-bit SIMD (32B). Events like MEM_LOAD_UOPS_RETIRED are in this category.
Outside of the L1 cache, transactions are almost always counted by cache line. If you don't know the size of the operands of your instructions, then you don't know how many loads (for example) to expect for each cache line when doing contiguous accesses, so "hit rates" can be difficult to interpret.

Appendix B.5 of the Intel Optimization Reference Manual (document 248966-040, April 2018) discusses the use of performance counters on Sandy Bridge. Almost all of that should be applicable to Ivy Bridge processors.