Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

Measure LLC occupancy and misses



I want to measure the cache occupancy and misses of a user-level program inside that program. The program is a loop of something, and I want to frequently read the counters so that I know the cache info for each iteration of the loop. I also want the measurement to be low overhead.

I searched around, and found rdmsr and rdpmc very related. It seems that I have to use rdpmc because this is a user-level program. However, I tried the test program in that post (see below), but it gives me segmentation fault. It seems I need to enable rdpmc, but I don't know how to do it.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

#define FATAL(fmt,args...) do {                \
    ERROR(fmt, ##args);                        \
    exit(1);                                   \
  } while (0)

#define ERROR(fmt,args...) \
    fprintf(stderr, fmt, ##args)

#define rdpmc(counter,low,high) \
     __asm__ __volatile__("rdpmc" \
        : "=a" (low), "=d" (high) \
        : "c" (counter))

int cpu, nr_cpus;

void handle ( int sig )
  FATAL("cpu %d: caught %d\n", cpu, sig);

int main ( int argc, char *argv[] )
  nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
  for (cpu = 0; cpu < nr_cpus; cpu++) {

    pid_t pid = fork();
    if (pid == 0) {
      cpu_set_t cpu_set;
      CPU_SET(cpu, &cpu_set);
      if (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set) < 0)
        FATAL("cannot set cpu affinity: %m\n");

      signal(SIGSEGV, &handle);

      unsigned int low, high;
      rdpmc(0, low, high);

      ERROR("cpu %d: low %u, high %u\n", cpu, low, high);

  return 0;

I am new to performance monitoring, so I hope someone could give me some detailed steps for doing it. I.e., how to enable rdpmc, and how to use it in my code (it seems I need to program some programmable counters, but I don't know how to do it). Some example code would be very helpful.

My CPU is Intel(R) Xeon(R) CPU E5-2650 v4.

0 Kudos
9 Replies
Black Belt
If you get an illegal instruction fault when executing RDPMC, it is because the OS has chosen not to allow user-mode execution of this instruction.  This is controlled by bit 8 in the "CR4" processor control register, typically referred to as "CR4.PCE".   This setting is discussed in the Intel Architectures Software Developer's Manual, Volume 3, Sections 2.5, 2.8, 5.9, etc.
Only the OS can change the setting of this bit, and I don't know any operating systems that have an interface that allows the user to request that the bit be changed.   In the Linux world, the default setting used to be to prohibit user-mode RDPMC, but that changed several years ago, and all of the systems I use currently allow it. 
Allowing user-mode RDPMC access is often disabled in virtual machines and in high-security environments where the system administrators don't want to allow users to have access to high-resolution timers.  I have heard that some of these may also disable RDTSC (using the CR4.TSD bit), but I have not run across any such configuration in my work.

After searching around, I find an easy work around: echo 2 > /sys/devices/cpu/rdpmc

Now I am wondering how can I program the counter so I can use rdpmc to read cache occupancy and misses?

Black Belt

Thanks for pointing out this interface!  I had not seen it before (and it is typically fairly painful finding documentation on these sysfs files....).

To program the counters manually, I use msrtools-1.3.   It provides command line tools to read and write MSRs using the /dev/cpu/*/msr device drivers.  The rdmsr.c and wrmsr.c codes provide good examples if you want to include the accesses to the /dev/cpu/*/msr devices inline in your own codes.

For the core performance counters, the procedure for manual programming is:

  1. Make sure the counters are globally enabled on each logical processor that you plan to use (IA32_PERF_GLOBAL_CTRL 0x38F)
  2. (Optional) make sure that the fixed-function counters are enabled on each logical processor (IA32_FIXED_CTR_CTRL 0x38D)
  3. Program the Performance Counter Event Select registers on each logical processor that you plan to use, using the event codes and umasks described in Sections 19.1 and 19.5 of Volume 3 of the Intel Architectures SW Developer's Manual.
    1. IA32_PERFEVTSEL0 0x186
    2. IA32_PERFEVTSEL1 0x187
    3. IA32_PERFEVTSEL2 0x188
    4. IA32_PERFEVTSEL3 0x188
  4. (Optional) When using one of the OFFCORE_RESPONSE events, an extra MSR needs to be programmed with the detailed filter information discussed in Section of Volume 3 of the Intel Architectures SW Developer's Manual.
    1. For performance counter event 0xB7, the auxiliary MSR is 0x1a6
    2. For performance counter event 0xBB, the auxiliary MSR is 0x1a7
    3. These events can be very confusing -- I recommend using the examples at (e.g.,

I often program the counters (using the procedure above) while running as root, so the program that I am testing does not need root privileges -- it just executes RDPMC instructions at the desired locations.

The specific events you will need are a larger topic that we can discuss once you get the infrastructure working.....


Thanks John!

I am trying the steps. 

./rdmsr 0x38f gives me 70000000f, which seems to enable all counters.

I then set 0x38D to 0x1ff (to enable all 3 counters). Then I tried these counters. I wrote a program called test, which use rdpmc to read (1<<30). But only on core 0 can the counters work (i.e., taskset -c 0 ./test gives new counter values, taskset -c x ./test always gives the same value.) The manual suggests that the counter only counts the events happens on the core which programs the IA32_FIXED_CTR_CTRL. But I also tried taskset -c 1 ./wrmsr 0x38d 0x1ff and then run test on core 1, but still gives me the same value. How can I do counter on other cores?

I also tried to set 0x186 to 0x13412e (counting LLC misses: evt 0x2e, umask 0x41, USR=OS=EN=1). But when I rdpmc counter 0, the value does not change, no matter on which core I run the test.



    I am doing a task on measure the LLC misses. I have read the whole post, but I still can't get the right results.Could you please share the test progarm to me?

   Thank you very much!

Black Belt

Do you want to measure LLC misses using the core performance counters or using the uncore performance counters?  The configuration and access is quite different between these two sets.

You can measure whole-program counts using the "perf stat" infrastructure if your operating system is new enough (relative to how new the processor is).

If you need inline counts you can use PAPI or LIKWID (

An infrastructure that programs and reads many different sets of counters at programmable intervals is -- this is currently limited to Intel Skylake Xeon and Cascade Lake Xeon processors.

Black Belt

The "rdmsr" and "wrmsr" programs support command-line arguments to program access the MSR using a specific core (e.g., "rdmsr -p 0 0x186") or to access the MSR using each core (e.g., "rdmsr -a 0x186").  This also works for "wrmsr", so it is easy to configure the counters on all cores at once.

The "NMI watchdog" often uses one of the fixed-function counters (typically counter 1).  When it does this, it will set the "interrupt on overflow" bit in MSR 0x38d, so "rdmsr -p 0 -c 0x38d" may return something like 0x0b0 or 0x3b3.   Disabling the NMI watchdog will clear the interrupt on overflow function, but also typically disables the counter.  Using any of the Linux "perf" functionality (e.g., "perf stat a.out") will also often disable the fixed-function counters on exit, so my scripts typically re-enable everything just in case I accidentally used a "perf" command before I start my job with inline counters.

It looks like you misplaced some bits in MSR 0x186.  The event you want should be programmed as:

wrmsr -a 0x186 0x0043412e

To help with individual bit fields, the "rdmsr" command supports an option to report any contiguous range of bits, e.g.:

rdmsr -p 0 -f 22:22 -c 0x186

should return "0x1" if bit 22 (ENable) is set.  Your example set bit 20 instead of bit 22 -- this enables interrupt on overflow, which requires that an interrupt handler be installed in the kernel.  Fortunately with bit 22 cleared, the counter did not increment, so it would never generate a performance monitoring interrupt.



I can successfully read cache misses!

However, I am a bit confused by the L3 cache occupancy described in the manual. It seems I need to configure it in a different way from configuring the cache miss counter. Is it possible to read L3 cache occupancy inline in my code? If so, how to set it up?

Black Belt

The "occupancy" described in Section 17.18 of Volume 3 of the Intel Architectures SW Developer's Manual is part of the "Cache Management Technology" infrastructure (CMT).   I have not used this infrastructure, but it looks like it is controlled and accessed via MSRs, which can only be accessed in kernel mode.    There are lots of other bits of information about L3 accesses that can be obtained from the core performance counters (particularly with the OFFCORE_RESPONSE event), but if you need to access the CMT infrastructure, then accesses will be more expensive.

With suitable permissions, a user-mode job can use the /dev/cpu/*/msr device drivers to request that the kernel perform these reads and writes, but they will not be low-overhead operations. 

Inside the kernel, the execution of an RDMSR instruction may only take 100 cycles (it varies by MSR number and by processor generation), but from user space the overhead will typically be in the 5000-20000 cycle range.   Why so expensive?   The device driver triggers an interrupt into the kernel, which then sets up an interprocessor interrupt to run the RDMSR/WRMSR code on the target logical processor, and then returns the data value (or return code) to the user process.  

I have not studied this systematically, but my impression is that the interprocessor interrupts are extra slow if you are interrupting a logical processor in another socket, or if you are interrupting a logical processor that is "busy" with another process.