Solved:

Sergey_S_Intel2 · ‎10-07-2015

Modern CPUs have quite a lot of performance counters, how to read them? I know many performance monitoring and profiler programs and libraries (PAPI, Vtune, profiler, linux_perf, etc.) but all these methods requires additional computation time (intrusiveness).

Also, modern Intel CPUs support rdpmc instruction and Linux OS (currently) support this instruction in user-level.

I would like to understand how to use GCC intrinsic to get CPU cycles and instruction executed to profile some function in my C code.

I understand I have to pin program execution to particular CPU core. Let’s assume the CPU is Haswell.

I appreciate for some small example of rdpmc usage.

For example, the code might looks like this

long long get_cycles(){
    unsigned int a=0, d=0;
    int ecx=(1<<30)+1; //What counter it selects?
    __asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
    return ((long long)a) | (((long long)d) << 32);
}

int main (int argc, char* argv[])
{
    long long start, finish;
    
    start = get_cycles();
    for (i = 0; i < 1000; i++)  {
        foo();
    }
    finish = get_cycles();
    printf("Total cycles : %ld\n",((double)(finish-start))/1000.0);
    return 0;
}

What ecx variable in get_cycles() must contain to provide CPU cycles and Instruction executed?

Thank you

McCalpinJohn · ‎10-21-2015

The performance counters are complicated largely because the hardware is complicated, and secondarily because Intel does not want to expose microarchitectural implementation details without good reason. (Patent trolls can be quite creative at re-interpreting the patents that they own to claim that a big company is violating the patents -- but they need to have some idea of how the processor is implemented to make these claims.)

Some aspects of hardware performance counters probably need to be restricted to elevated privilege levels. For example, configuring the hardware performance counters to generate interrupts has the potential to severely impact system performance and usability. On the other hand, most of what the performance counters do is perfectly safe -- the vendors do a very good job of ensuring that programming random bits into the performance counter control registers is "safe" -- you may not be able to interpret the results, but the processor runs just fine.

I prefer that the hardware performance counters remain as low-level features, and not as registers that get saved and restored on context switches. (It would be hard to use the counters to measure things like context switch overhead if they were swapped in and out.) But leaving the counters as "raw" low-level features means that they cannot easily be shared, and it means that they provide a potentially high-bandwidth covert channel between processes.

In the high performance computing world where I work, systems are seldom time-shared, so we don't really need to worry about either sharing the counters or about covert channels. In the production environment a job is assigned a set of nodes and no other user is allowed access to those nodes for the duration of the job. The nodes are still shared between the OS (and all its subsidiary processes) and the user (and all the auxiliary processes that the user might cause to be started), but since this is the standard mode of operation, dealing with this sharing is part of the performance puzzle that we are trying to understand.

To use the hardware performance counters manually, a variety of tools are needed:

For the hardware performance counters in the processor cores, I build the "rdmsr" and "wrmsr" command-line tools from "msrtools-1.2".
1. I use a script to configure the global configuration registers and the PERFEVTSEL* MSRs for the programmable core counters.
2. For whole-program measurements, I read the counters using the "rdmsr" program before and after the execution (taking care that the run is short enough that the counters can't be incremented more than 2^48 times during the run). You can also use "perf stat" for these sorts of measurements.
3. For interval measurements inside the code, I program the counters using the script, then use the RDPMC command to read them at the desired locations in the code.
For the "uncore" counters there are three different interfaces used, depending on the processor model:
1. Some "uncore" counters use MSRs and can be configured using "wrmsr" as above. Unfortunately these can only be read from inside the kernel (since the RDMSR instruction can only be executed at ring 0). If the program is being run by root (or is owned by root and has the setuid bit set), then the program can open the /dev/cpu/*/msr device files and read or write the counters using pread() or pwrite() calls. These are kernel calls so they cost a few thousand cycles each, but there is nothing that can be done about this. (One thing that could help is to build a kernel module that could return multiple MSR values with a single call.)
2. Some "uncore" counters are in "PCI configuration space". The root user can read/write these counters using the "setpci" command-line program. As with the MSR-based counters, a root user can open the device driver files (in /proc/bus/pci, I think) and read/write the counters using pread() and pwrite() commands (limited to 32-bit transactions).
3. Some processors include "uncore" counters in a different range of memory-mapped IO space. Working with these is an advanced topic....

Here is a fairly typical script that I use to set up the counters (edited for clarity):

#!/bin/bash

export NCORES=`cat /proc/cpuinfo | grep -c processor`
echo "Number of cores is $NCORES"
export MAXCORE=`expr $NCORES - 1`

# Enable all counters in IA32_PERF_GLOBAL_CTRL
#   bits 34:32 enabled the three fixed function counters
#	bits 7:0 enable the eight programmable counters
echo "Checking IA32_PERF_GLOBAL_CTRL on all cores"
echo "  (should be 00000007000000ff)"
for core in `seq 0 $MAXCORE`
do
	echo -n "$core "
	~/bin/rdmsr -p $core -x -0 0x38f
	~/bin/wrmsr -p $core 0x38f 0x00000007000000ff
done

# Core Performance Counter Event Select MSRs
#   Counter	 MSR
#	   0    0x186
#	   1    0x187
#	   2    0x188
#	   3    0x189
#	   4    0x18a
#	   5    0x18b
#	   6    0x18c
#	   7    0x18d

# Dump all performance counter event select registers on all cores
if [ 0 == 1 ]
then
	echo "Printing out all performance counter event select registers"
	echo "MSR    CORE    CurrentValue"
	for PMC_MSR in 186 187 188 189 18a 18b 18c 18d
	do
		for CORE in `seq 0 $MAXCORE`
		do
			echo -n "$PMC_MSR $CORE "
			~/bin/rdmsr -p $core -0 -x 0x${PMC_MSR}
		 done
	done
fi

# Counter 0 Uops Dispatched on Port 0		0x004301a1
# Counter 1 Uops Dispatched on Port 1		0x004302a1
# Counter 2 Uops Dispatched on Port 2		0x004304a1
# Counter 3 Uops Dispatched on Port 3		0x004308a1
# Counter 4 actual core cycles unhalted		0x0043003c
# Counter 5 Uops Dispatched on Port 5		0x004320a1
# Counter 6 cycles with no uops delivered from back end to
#   front end & there is no back end stall	0x0143019c
# Counter 7 Uops issued from RAT to RS		0x0043010e

echo "Programming counters 0,1,2,3"
for core in `seq 0 $MAXCORE`
do
	~/bin/wrmsr -p $core 0x186 0x004301a1
	~/bin/wrmsr -p $core 0x187 0x004302a1
	~/bin/wrmsr -p $core 0x188 0x004304a1
	~/bin/wrmsr -p $core 0x189 0x004308a1
	~/bin/wrmsr -p $core 0x18a 0x0043003c
	~/bin/wrmsr -p $core 0x18b 0x004320a1
	~/bin/wrmsr -p $core 0x18c 0x0143019c
	~/bin/wrmsr -p $core 0x18d 0x0043010e
done

View solution in original post

McCalpinJohn · ‎10-08-2015

In recent Intel processors there are two ways to use the input argument for the RDPMC instruction.

Values of 0 to 3 (or 0 to 7) select one of the programmable performance counters.

Values of 2^30, 2^30+1, and 2^30+2 select one of the "fixed-function" performance counters. Documentation of this use is not very clear, and not particularly easy to find, so I usually just go back to my own code rather than trying to find it in the Intel documents.

The routines below provide access to each of the "fixed function" performance counter events with names that are easier to remember than the corresponding performance counter number.

Note that on some/many systems these fixed-function counters are either not enabled by default or they are enabled and in use by another process (sometimes the BIOS and sometimes the "NMI watchdog" process). If they are in use by another process they are probably configured to generate an interrupt on overflow, and the interrupt handler will reset the counter value every time. For example, the NMI watchdog on Linux systems often uses the "actual cycles" counter set up to overflow every 2 billion cycles (i.e., the counter is reset to (2^48-1 - 2^32) by the interrupt handler). In this case it is still perfectly safe to read the counter and it is still quite useful for measuring over short intervals (i.e., much less than 2 billion cycles) as long as you can do "sanity-checking" on the results and are able to discard the occasional results that are corrupted by the reset of the counter.

// rdpmc_instructions uses a "fixed-function" performance counter to return the count of retired instructions on
//       the current core in the low-order 48 bits of an unsigned 64-bit integer.
unsigned long rdpmc_instructions()
{
   unsigned a, d, c;

   c = (1<<30);
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

// rdpmc_actual_cycles uses a "fixed-function" performance counter to return the count of actual CPU core cycles
//       executed by the current core.  Core cycles are not accumulated while the processor is in the "HALT" state,
//       which is used when the operating system has no task(s) to run on a processor core.
unsigned long rdpmc_actual_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+1;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

// rdpmc_reference_cycles uses a "fixed-function" performance counter to return the count of "reference" (or "nominal")
//       CPU core cycles executed by the current core.  This counts at the same rate as the TSC, but does not count
//       when the core is in the "HALT" state.  If a timed section of code shows a larger change in TSC than in
//       rdpmc_reference_cycles, the processor probably spent some time in a HALT state.
unsigned long rdpmc_reference_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+2;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

Sergey_S_Intel2 · ‎10-08-2015

John,

Thank you very much. I completely agree with you – Intel documentation has no clear explanation of the rdpmc instruction usage.

As I understand, these counter numbers depend on CPU family. It can be detected by cpuid instruction. As described in "linux_perf" interface we have some "common" counters that supported on many CPUs (not intel only).

Could you please share these counters if you have such information?

To summarize previous info:

enum CPUCounters {

    cpuCOUNT_HW_INSTRUCTIONS = 1<<30, //count of retired instructions on the current core in the low-order 48 bits of an unsigned 64-bit integer

    cpuCOUNT_HW_CPU_CYCLES = (1<<30)+1,// count of actual CPU core cycles executed by the current core.  Core cycles are not accumulated while the processor is in the "HALT" state, which is used when the operating system has no task(s) to run on a processor core.

    cpuCOUNT_HW_REF_CPU_CYCLES = (1<<30)+2, //count of "reference" (or "nominal") CPU core cycles executed by the current core.  This counts at the same rate as the TSC, but does not count when the core is in the "HALT" state.  If a timed section of code shows a larger change in TSC than in rdpmc_reference_cycles, the processor probably spent some time in a HALT state.

    cpuSIZE

};

Also, it could be interesting how to detect if counter used by another program. In case of watchdog, I can detect it by reading /proc/sys/kernel/nmi_watchdog file on Linux. Is there any general way to understand if particular counter used by some other process?

How to clean (set to zero) these counters?

Modern Linux kernels allow rdpmc in user-level. If run this instruction on relatively old kernels the program crashes.

How to detect ability to run rdmpc instruction in runtime? As I understand, I have to read some special bit of the special register but has no example how to do it.

Thank you very much

Sergey

Sergey_S_Intel2 · ‎10-09-2015

Thomas, As I see I can’t read /sys/devices/cpu/rdpmc file at user level. I think I can use ioctl calls to configure counters for rdpmc. Could you please provide an example how to manage simple counters discussed above? I can use ioctl() at the global application constructor and simply use rdpmc from user level later. Is the counter cpuCOUNT_HW_INSTRUCTIONS (mentioned above) always corresponded to instruction counter or it should be configured? In other words, is it guaranteed that rdpmc(1<<30) always return HW instruction counter value? Sergey

McCalpinJohn · ‎10-09-2015

(1) To see if the RDPMC instruction is allowed at runtime, just try to use it and build an exception handler to catch the signal if one is thrown. The test code I use is:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

#define FATAL(fmt,args...) do {                \
    ERROR(fmt, ##args);                        \
    exit(1);                                   \
  } while (0)

#define ERROR(fmt,args...) \
    fprintf(stderr, fmt, ##args)

#define rdpmc(counter,low,high) \
     __asm__ __volatile__("rdpmc" \
        : "=a" (low), "=d" (high) \
        : "c" (counter))

int cpu, nr_cpus;

void handle ( int sig )
{
  FATAL("cpu %d: caught %d\n", cpu, sig);
}

int main ( int argc, char *argv[] )
{
  nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
  for (cpu = 0; cpu < nr_cpus; cpu++) {

    pid_t pid = fork();
    if (pid == 0) {
      cpu_set_t cpu_set;
      CPU_ZERO(&cpu_set);
      CPU_SET(cpu, &cpu_set);
      if (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set) < 0)
        FATAL("cannot set cpu affinity: %m\n");

      signal(SIGSEGV, &handle);

      unsigned int low, high;
      rdpmc(0, low, high);

      ERROR("cpu %d: low %u, high %u\n", cpu, low, high);
      break;
    }
  }

  return 0;
}

(2) The fixed-function performance counters are the same on all recent Intel processors. As part of the "Architectural Performance Monitoring" facility, they should not change (at least not very often!).

The fixed-function counters are described in Section 18.2 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384, revision 055). Recent processors typically support Architectural Performance Monitoring Version 3, which is described in Section 18.2.3, but looking through Chapter 19 of Volume 3 of the Intel Architectures Software Developer's Manual, it looks like these fixed events are the same all the way back to the Core processor and are also supported the Atom processors (as well as all of the more recent processors).

The specific events counted by the three fixed-function architectural performance counters are described in Table 19-2 of Section 19.1 "Architectural Performance Monitoring Events". The assignment of function to the MSR address of the fixed-function counters is definitely fixed and the fixed-function counters are referred to as FIXED_CTR_0, FIXED_CTR_1, and FIXED_CTR2. It seems extremely unlikely that Intel would change the mapping of the RDPMC counter numbers to access these using anything other than the obvious approach of 1<<30, 1<<30+1, and 1<<30+2.

(3) Control of the counters is through MSRs. The MSRs relating to performance counters are described in Chapters 18 and 35 of Volume 3 of the Intel Software Developer's Manual.

Linux exposes a device driver for the MSRs via the /dev/cpu/*/msr interfaces.
The command-line tools "rdmsr" and "wrmsr" from "msrtools-1.2" provide an easy to use interface to read and write MSRs.
By default, root access is required to read or write the /dev/cpu/*/msr device drivers.
- You can run "rdmsr" and "wrmsr" from a root account, or
- You can chgrp the /dev/cpu/*/msr files to a group that your user account belongs to and then chmod the /dev/cpu/*/msr files to give group read/write permissions, or
- You can change the ownership of the rdmsr and wrmsr binaries to root and mark them as "setuid", or
- You can write your own loadable kernel module to do exactly what you need.
In the most recent versions of Linux you may also need to fiddle with "capabilities" to enable access -- I don't know how this works.

(4) There is no general "reservation" mechanism for the counters, but it is pretty easy to tell if the fixed-function counters are in use.

First you need to look at the IA32_PERF_GLOBAL_CTRL MSR (0x38F) to see if the counters are globally enabled. This is described in each of the subsections of Section 18.2 (Architectural Performance Monitoring) of Volume 3 of the Software Developer's Guide, as well as in Chapter 35. There is one bit to enable each of the fixed-function counters and one bit to enable each of the programmable counters.
Next you need to check the IA32_FIXED_CTR_CTRL MSR (0x38D) (described in the same places). This MSR determines whether the event counts in user mode or kernel mode or both, whether the event counts for only the logical processor that programmed it or for both logical processors that share a physical core (when HyperThreading is enabled), and whether the counter generates a Performance Monitor Interrupt (PMI) when it overflows its 48-bit range.
A fixed-function counter is almost certainly in use if its PMI bit in the IA32_FIXED_CTR_CTRL MSR is set.

Sergey_S_Intel2 · ‎10-18-2015

Sorry for delay, I had been overloaded by other tasks.

John,

I found HSW CPU E5-2697 v3 @ 2.60GHz with Red Hat Enterprise Linux Server release 6.5 (Santiago) and kernel 2.6.32-358.6.2.el6.x86_64 machine with rdpmc enabled.

It doesn’t fail if I use rdpmc instruction. I ran your test code and found:

cpu 0: low 4085649949, high 65535
cpu 1: low 3651737778, high 65535
cpu 2: low 3404553720, high 65535
cpu 3: low 2885785273, high 65535
cpu 4: low 2163297754, high 65535
cpu 5: low 3387747633, high 65535
cpu 6: low 4036661582, high 65535
cpu 7: low 4254544390, high 65535
cpu 8: low 2344492980, high 65535
cpu 9: low 3150679521, high 65535
cpu 10: low 3459804814, high 65535
cpu 11: low 3361664909, high 65535
… etc

What these numbers mean in case of the test use rdpmc(0, low, high);? Why “low” is different on different CPUs?

Thomas,

The read() system call has high intrusiveness. I integrated these Perf_events code into my test to initialize performance counter system.

I use following main loop (some unimportant code, like output, removed):

int perf_fds;

void init_instructions() {
    struct perf_event_attr attr;
    memset(&attr, 0, sizeof(struct perf_event_attr));
    attr.type = PERF_TYPE_HARDWARE;
    attr.size = sizeof(struct perf_event_attr);
    attr.config = PERF_COUNT_HW_INSTRUCTIONS;
    attr.inherit = 1;
    attr.pinned = 1;
    attr.exclude_idle = 1;
    attr.exclude_kernel = 1;
    perf_fds = perf_event_open(&attr, 0, -1, -1, 0);
    ioctl(perf_fds, PERF_EVENT_IOC_RESET, 0); // Resetting counter to zero
    ioctl(perf_fds, PERF_EVENT_IOC_ENABLE, 0); // Start counters
}

int main () {
    init_instructions();
    for(int attempts = 0; attempts <= 20; ++attempts) {
        rdtscp(&chipOrig, &coreOrig);
        foo(loop, &tmp, &attStart, &attEnd, timerData.data(), &chip, &core, perf_fds);
    }
    close_instructions();
    return 0;
}

It call foo() from other object file to guarantee independent state for compiler.

void foo(int loop, long long *tmp, long long *attStart, long long *attEnd, long long *timerData, unsigned long *chip, unsigned long *core, int pId) {
    *attStart = rdtsc();
    for(int i = 0; i < loop; i++) {
        long long start = read_perf_instructions(pId);//__builtin_ia32_rdpmc(INSTR_COUNT);
        *tmp += rdtsc();
        timerData = read_perf_instructions(pId) - start;
    }
    *attEnd = rdtscp(chip, core);
}

In this loop I measure number of the instruction between read_perf_instructions(pId) calls.

If I use read() system call to read perf_event counter I get following output:

Loop iterations 65536, result vector 524288 bytes
      Iter   Average       Min       Max    Median  First 10 values
         0    1445.3        20        21        20  20 20 20 20 20 20 20 20 20 20
         1    1439.2        20        21        20  20 20 20 20 20 20 20 20 20 20
         2    1442.4        20        21        20  20 20 20 20 20 20 20 20 20 20
         3    1441.3        20        21        20  20 20 20 20 20 20 20 20 20 20
         4    1441.4        20        21        20  20 20 20 20 20 20 20 20 20 20
…
        16    1439.2        20        21        20  20 20 20 20 20 20 20 20 20 20
        17    1438.3        20        21        20  20 20 20 20 20 20 20 20 20 20
        18    1438.9        20        21        20  20 20 20 20 20 20 20 20 20 20
        19    1439.7        20        21        20  20 20 20 20 20 20 20 20 20 20
        20    1438.7        20        20        20  20 20 20 20 20 20 20 20 20 20

Average means “(attEnd-attStart) / loop” in the listing above

Min, Max and Median are from the vector timerData[];

The loop in foo() looks following in assembler

  401876:       4c 8d 7c 24 28          lea    0x28(%rsp),%r15
  40187b:       4d 89 c5                mov    %r8,%r13
  40187e:       45 31 e4                xor    %r12d,%r12d
  401881:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  401888:       ba 08 00 00 00          mov    $0x8,%edx
  40188d:       4c 89 fe                mov    %r15,%rsi
  401890:       44 89 f7                mov    %r14d,%edi
  401893:       48 c7 44 24 28 00 00    movq   $0x0,0x28(%rsp)
  40189a:       00 00
  40189c:       e8 57 ef ff ff          callq  4007f8 <read@plt>
  4018a1:       48 8b 44 24 28          mov    0x28(%rsp),%rax
  4018a6:       48 89 44 24 08          mov    %rax,0x8(%rsp)
  4018ab:       0f 31                   rdtsc
  4018ad:       48 c1 e2 20             shl    $0x20,%rdx
  4018b1:       4c 89 fe                mov    %r15,%rsi
  4018b4:       44 89 f7                mov    %r14d,%edi
  4018b7:       48 8d 04 02             lea    (%rdx,%rax,1),%rax
  4018bb:       48 01 03                add    %rax,(%rbx)
  4018be:       ba 08 00 00 00          mov    $0x8,%edx
  4018c3:       48 c7 44 24 28 00 00    movq   $0x0,0x28(%rsp)
  4018ca:       00 00
  4018cc:       41 83 c4 01             add    $0x1,%r12d
  4018d0:       e8 23 ef ff ff          callq  4007f8 <read@plt>
  4018d5:       48 8b 44 24 28          mov    0x28(%rsp),%rax
  4018da:       48 2b 44 24 08          sub    0x8(%rsp),%rax
  4018df:       49 89 45 00             mov    %rax,0x0(%r13)
  4018e3:       49 83 c5 08             add    $0x8,%r13
  4018e7:       44 39 e5                cmp    %r12d,%ebp
  4018ea:       7f 9c                   jg     401888 <foo+0x48>
  4018ec:       0f 01 f9                rdtscp

As you can see we have 11 instructions between read() calls but reported 20 in the listing above.

As I understand, I can initialize the counters by perf_event system and use rdpmc instruction later to get the counter. (BTW, did you use __builtin_ia32_rdpmc gcc intrincic? I can’t compile it with gcc.)

I just replace read() by rdpmc(1<<30), as John mentioned, and found:

Loop iterations 65536, result vector 524288 bytes
      Iter   Average       Min       Max    Median  First 10 values
         0      86.6         0         0         0  0 0 0 0 0 0 0 0 0 0
         1      86.3         0         0         0  0 0 0 0 0 0 0 0 0 0
         2      87.2         0         0         0  0 0 0 0 0 0 0 0 0 0
         3      86.3         0         0         0  0 0 0 0 0 0 0 0 0 0
…
        17      86.3         0         0         0  0 0 0 0 0 0 0 0 0 0
        18      86.3         0         0         0  0 0 0 0 0 0 0 0 0 0
        19      86.9         0         0         0  0 0 0 0 0 0 0 0 0 0
        20      86.2         0         0         0  0 0 0 0 0 0 0 0 0 0

In assembler it look like:

  4017ec:       b9 00 00 00 40          mov    $0x40000000,%ecx
  4017f1:       48 8d 2c fd 08 00 00    lea    0x8(,%rdi,8),%rbp
  4017f8:       00
  4017f9:       31 ff                   xor    %edi,%edi
  4017fb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  401800:       0f 33                   rdpmc
  401802:       48 89 d3                mov    %rdx,%rbx
  401805:       49 89 c3                mov    %rax,%r11
  401808:       0f 31                   rdtsc
  40180a:       48 c1 e2 20             shl    $0x20,%rdx
  40180e:       48 8d 04 02             lea    (%rdx,%rax,1),%rax
  401812:       48 01 06                add    %rax,(%rsi)
  401815:       0f 33                   rdpmc
  401817:       48 c1 e2 20             shl    $0x20,%rdx
  40181b:       48 c1 e3 20             shl    $0x20,%rbx
  40181f:       48 8d 04 02             lea    (%rdx,%rax,1),%rax
  401823:       4e 8d 1c 1b             lea    (%rbx,%r11,1),%r11
  401827:       4c 29 d8                sub    %r11,%rax
  40182a:       49 89 04 38             mov    %rax,(%r8,%rdi,1)
  40182e:       48 83 c7 08             add    $0x8,%rdi
  401832:       48 39 ef                cmp    %rbp,%rdi
  401835:       75 c9                   jne    401800 <foo+0x30>
  401837:       0f 01 f9                rdtscp

What I did wrong? Why I get zero instead pre-initialized HW_INSTRUCTION counter?

I need to measure quite short events in the program and need some way to use low-intrusiveness method to, at least, reading the PMU counters.

Sergey

McCalpinJohn · ‎10-19-2015

The test_rdpmc program just reads the current values in PMC 0 and prints out the lower 32-bits and upper 32-bits instead of combining them into a 64-bit value. There is no deep meaning here -- as long as the program does not have an illegal instruction fault then the counters are readable. The results are different because each core has accumulated different counts (and the code does not attempt to read them at the same time anyway -- it uses sched_setaffinity() to bind to one core at a time and then reads counter 0 on that core using a simple inline assembly macro.)
In your results the high-order counts are all 65535, which means that this counter has been set to be very close to the overflow threshold. The smallest of the low-order counts is just over 2^31, which is consistent with the counters being set to the overflow value minus 2^31, so they will overflow and generate an interrupt every 2 billion events. This is a typical use model for sampling-based performance monitoring, but it does make it trickier to use the counters for interval analysis, since they are getting reset frequently (and since the CPU is receiving interrupts to process these overflows fairly frequently).
There is definitely timing overhead in reading the counters, though it varies slightly across processor models. The RDTSC, RDTSCP, and RDPMC calls all take cycles -- somewhere in the range of 25 cycles to 42 cycles.
When counting instructions, I have seen exactly the results I expected from inline RDPMC calls using the fixed-function counter 0. For an unrolled loop that shows 6 instructions for each RDPMC call, I see the counter increment by 6 each time until the end of the loop where the loop control instructions increase the number of instructions to 10 -- also correctly reported. More details below....
It is not clear that you verified that the fixed-function counters were enabled. MSR 0x38d IA32_FIXED_CTR_CTRL and MSR 0x38F IA32_PERF_GLOBAL_CTRL must both be set up correctly to enable the fixed-function counters to operate. This is described in Section 18.2 of Volume 3 of the Intel Architecture Software Developer's Manual.

Example of testing overhead with the RDPMC Fixed Counter 0 (Retired Instructions) event:

The code simply reads the counter repeatedly and saves the values in an array. I keep the number of iterations short so that the array will stay in cache.

One example looks like:

#define rdpmc(counter,low,high) \
     __asm__ __volatile__("rdpmc" \
        : "=a" (low), "=d" (high) \
        : "c" (counter))

    for (j=0; j<NSAMPLES; j++) values64 = 0;  // make sure array is in cache

    for (j=0; j<NSAMPLES; j+=8) {
          rdpmc(fixed0, low, high);
          values64 = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+1] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+2] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+3] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+4] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+5] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+6] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+7] = ((unsigned long) high << 32) + (unsigned long) low;
      }

With the Intel compiler the first rdpmc+store groups are compiled to:

        rdpmc                                                   #211.0
        movl      %edx, %edx                                    #212.38
        movl      %eax, %eax                                    #212.68
        shlq      $32, %rdx                                     #212.46
        addq      %rdx, %rax                                    #212.68
        movq      %rax, 8+values64(,%rbx,8)                     #212.5

with gcc I see slightly different code -- 5 instructions per invocation:

    rdpmc
    salq    $32, %rdx
    mov %eax, %eax
    leaq    (%rdx,%rax), %rax
    movq    %rax, values64(%rsi)

with a little bit of thought this can be reduced to 3 instructions for the repeated iterations (and 6 instructions for the final iteration in the unrolled loop that requires the extra increment/compare/branch). At one point I ran across a version of gcc that did this automagically, but now I find that I have to write the code with explicit 32-bit stores to get this result:

    rdpmc
    movl    %edx, (%rsi)
    movl    %eax, 4(%rsi)

Sergey_S_Intel2 · ‎10-20-2015

John,

Thank you for your comments. In case of #5 I used init_instructions() procedure described in a listing above. That example assumes no explicit check for proper configuration of the PMU control registers.

init_instructions() uses Linux_perf interface to configure PMU counter. The idea is to use Linux_perf to configure counter and use rdpmc instruction to read the counter.

How did you configure “Fixed Counter 0” to read it by rdpmc in last example?

Sergey

Update.

I found the issue in the code I used to measure instruction count. This is __builtin_ia32_rdpmc GCC intrinsic. The GCC 5.0 generates quite funny code that uses only one rdpmc instruction and, as consequence, the difference between calls became zero.

Also, I found how to use rdpmc directly with linux_perf interface initialization. But this way is provides more questions.

The standard linux_perf way to get instruction count measurements is in getting the counter value by read() system call.

The C source loop is

void foo(int loop, long long *tmp, long long *attStart, long long *attEnd, long long *timerData, unsigned long *chip, unsigned long *core, int pId) {
    int i = 0;
    *attStart = rdtsc();
    for(i = 0; i < loop; i++) {
        long long start = read_perf_instructions(pId);
        *tmp += rdtsc();
        long long stop = read_perf_instructions(pId);
        timerData = stop - start;
    }
    *attEnd = rdtscp(chip, core);
}

  402140:       48 8d 74 24 18          lea    0x18(%rsp),%rsi
  402145:       ba 08 00 00 00          mov    $0x8,%edx
  40214a:       89 ef                   mov    %ebp,%edi
  40214c:       48 c7 44 24 18 00 00    movq   $0x0,0x18(%rsp)
  402153:       00 00
  402155:       e8 46 ec ff ff          callq  400da0 <read@plt>
  40215a:       4c 8b 64 24 18          mov    0x18(%rsp),%r12
  40215f:       0f 31                   rdtsc
  402161:       48 c1 e2 20             shl    $0x20,%rdx
  402165:       48 8d 74 24 18          lea    0x18(%rsp),%rsi
  40216a:       89 ef                   mov    %ebp,%edi
  40216c:       48 01 d0                add    %rdx,%rax
  40216f:       48 01 03                add    %rax,(%rbx)
  402172:       ba 08 00 00 00          mov    $0x8,%edx
  402177:       48 c7 44 24 18 00 00    movq   $0x0,0x18(%rsp)
  40217e:       00 00
  402180:       49 83 c6 08             add    $0x8,%r14
  402184:       e8 17 ec ff ff          callq  400da0 <read@plt>
  402189:       48 8b 44 24 18          mov    0x18(%rsp),%rax
  40218e:       4c 29 e0                sub    %r12,%rax
  402191:       49 89 46 f8             mov    %rax,-0x8(%r14)
  402195:       4d 39 ee                cmp    %r13,%r14
  402198:       75 a6                   jne    402140 <_Z3fooiPxS_S_S_PmS0_i+0x40>
  40219a:       0f 01 f9                rdtscp

We have 11 instruction between read() calls but the call itself has some unknown number of instruction. The test shows number of instructions and time (in Average field) spent in one loop iteration.

Loop iterations 65536, result vector 524288 bytes
      Iter   Average       Min       Max    Median  First 10 values
         0    1548.4       790      6282       790  790 790 790 790 790 790 790 790 790 790
         1    1547.2       790      3968       790  790 790 790 790 790 790 790 790 790 790
         2    1546.2       790      3973       790  790 790 790 790 790 790 790 790 790 790
         3    1546.6       790      3973       790  790 790 790 790 790 790 790 790 790 790
         4    1546.0       790      3973       790  790 790 790 790 790 790 790 790 790 790
         5    1544.9       790      3973       790  790 790 790 790 790 790 790 790 790 790
         6    1544.4       790      5040       790  790 790 790 790 790 790 790 790 790 790
         7    1544.4       790      3973       790  790 790 790 790 790 790 790 790 790 790
         8    1544.2       790      3973       790  790 790 790 790 790 790 790 790 790 790
         9    1544.6       790      3973       790  790 790 790 790 790 790 790 790 790 790
        10    1544.4       790      3973       790  790 790 790 790 790 790 790 790 790 790
        11    1540.1       790      6527       790  790 790 790 790 790 790 790 790 790 790
        12    1544.8       790      4053       790  790 790 790 790 790 790 790 790 790 790
        13    1555.3       790      3973       790  790 790 790 790 790 790 790 790 790 790
        14    1556.5       782      3973       790  816 790 790 790 790 790 790 790 790 790
        15    1555.6       790      3973       790  790 790 790 790 790 790 790 790 790 790
        16    1555.9       790      3973       790  790 790 790 790 790 790 790 790 790 790
        17    1553.7       790      3973       790  790 790 790 790 790 790 790 790 790 790
        18    1554.7       790      3973       790  790 790 790 790 790 790 790 790 790 790
        19    1554.8       790      3973       790  790 790 790 790 790 790 790 790 790 790
        20    1554.7       790      4668       790  790 790 790 790 790 790 790 790 790 790

If we change read() system call by rdpmc (inside read_perf_instructions(pId) function ) instruction we can see different picture

  4020e0:       0f 33                   rdpmc
  4020e2:       49 03 42 10             add    0x10(%r10),%rax
  4020e6:       48 c1 e2 20             shl    $0x20,%rdx
  4020ea:       48 8d 3c 10             lea    (%rax,%rdx,1),%rdi
  4020ee:       0f 31                   rdtsc
  4020f0:       48 c1 e2 20             shl    $0x20,%rdx
  4020f4:       48 01 d0                add    %rdx,%rax
  4020f7:       48 01 06                add    %rax,(%rsi)
  4020fa:       0f 33                   rdpmc
  4020fc:       49 03 42 10             add    0x10(%r10),%rax
  402100:       48 c1 e2 20             shl    $0x20,%rdx
  402104:       49 83 c0 08             add    $0x8,%r8
  402108:       48 01 c2                add    %rax,%rdx
  40210b:       48 29 fa                sub    %rdi,%rdx
  40210e:       49 89 50 f8             mov    %rdx,-0x8(%r8)
  402112:       49 39 d8                cmp    %rbx,%r8
  402115:       75 c9                   jne    4020e0 <_Z3fooiPxS_S_S_PmS0_i+0x30>
  402117:       0f 01 f9                rdtscp

Loop iterations 65536, result vector 524288 bytes
      Iter   Average       Min       Max    Median  First 10 values
         0      85.1         8         8         8  8 8 8 8 8 8 8 8 8 8
         1      86.8         8         9         8  8 8 8 8 8 8 8 8 8 8
         2      84.3         8         8         8  8 8 8 8 8 8 8 8 8 8
         3      86.6         8         8         8  8 8 8 8 8 8 8 8 8 8
         4      85.3         8         9         8  8 8 8 8 8 8 8 8 8 8
         5      85.8         8         8         8  8 8 8 8 8 8 8 8 8 8
         6      86.4         8         9         8  8 8 8 8 8 8 8 8 8 8
         7      84.8         8         8         8  8 8 8 8 8 8 8 8 8 8
         8      86.2         8         9         8  8 8 8 8 8 8 8 8 8 8
         9      85.3         8         9         8  8 8 8 8 8 8 8 8 8 8
        10      86.2         8         9         8  8 8 8 8 8 8 8 8 8 8
        11      85.3         8         9         8  8 8 8 8 8 8 8 8 8 8
        12      85.6         8         9         8  8 8 8 8 8 8 8 8 8 8
        13      85.3         8         8         8  8 8 8 8 8 8 8 8 8 8
        14      84.6         8         8         8  8 8 8 8 8 8 8 8 8 8
        15      86.2         8         9         8  8 8 8 8 8 8 8 8 8 8
        16      86.0         8         9         8  8 8 8 8 8 8 8 8 8 8
        17      86.2         8         9         8  8 8 8 8 8 8 8 8 8 8
        18      85.3         8         9         8  8 8 8 8 8 8 8 8 8 8
        19      85.9         8         9         8  8 8 8 8 8 8 8 8 8 8
        20      86.2         8         9         8  8 8 8 8 8 8 8 8 8 8

This is expected behavior but the linux_perf has lot of configuration knobs and I suspect it can works in unexpected manner in some situations.

In the documentation here http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html we have usage example

do {
    seq = pc->lock;
    barrier();
    enabled = pc->time_enabled;
    running = pc->time_running;

    if (pc->cap_usr_time && enabled != running) {
        cyc = rdtsc();
        time_offset = pc->time_offset;
        time_mult   = pc->time_mult;
        time_shift  = pc->time_shift;
    }

    idx = pc->index;
    count = pc->offset;

    if (pc->cap_usr_rdpmc && idx) {
        width = pc->pmc_width;
        count += rdpmc(idx - 1);
    }

    barrier();
} while (pc->lock != seq);

Why we need this loop? I didn't use this loop in my test. Does it mean it is wrong for HSW architecture CPU?

Why we have to add pc->offset? I neen difference between rdpmc measurements. May I use just rdpmc return value as the counter value from last counter read?

I didn't understand why time_offset, time_mult, etc. variables here.

So, I just looking for the easy way to count simple CPU events from user space.

Sergey

McCalpinJohn · ‎10-20-2015

I don't use the perf events subsystem to set up or manage the performance counters because it incurs a lot of overhead due to counter virtualization. The virtualization serves two purposes: (1) to count separately for each process, and (2) to expand the 48-bit hardware counters to 64 bits.

Virtualization by process can be overridden (e.g. "perf stat" has an option to count "globally"), but virtualization of the counters from 48-bit raw hardware registers to 64-bit virtualized registers cannot be overridden (as far as I know). To create a 64-bit virtual counter, the kernel reads the counters frequently (probably at the 1 millisecond Linux kernel scheduler interrupt, but it does not really matter), and adds the deltas to a 64-bit value that it keeps in memory. That is why you need a "read()" call -- this causes "perf" to read the counter again, compute the delta from the previous read, add the new delta to the 64-bit value in memory, and return the updated 64-bit count.

There is no way that this process can be fast. It is typically at least 500 cycles to get in and out of the kernel (with a very simple driver), and can be a lot more expensive. PAPI overheads are typically in excess of 2000 cycles to read a single counter. Part of that is in PAPI, but a large chunk is in the kernel access required by the underlying "perf events" substrate.

So I use a completely different approach. I program the counters explicitly using the "wrmsr.c" program from the "msrtools-1.2" package (either compiled as a standalone executable or with the important bits imported into my C program). This uses the /dev/cpu/*/msr interfaces which can be read/written by root, avoiding the need for yet another kernel module. These are high-overhead accesses (especially when run as a shell command), but they are only needed for setup, and allow me to use inline RDPMC calls to get raw hardware event counts without any overhead or confusion from virtualization.

There are a modest number of MSRs that must be set up correctly to use this approach, and care must be taken not to break any other process that might be using the performance counters. The MSRs that must be set up include MSR 0x38F IA32_PERF_GLOBAL_CTRL (in all cases) and MSR 0x38d IA32_FIXED_CTR_CTRL if you want to use the fixed-function counters, and of course the IA32_PERFEVTSEL* MSRs (0x186, 0x187, 0x188, 0x189 on all recent platforms, and 0x18A, 0x18B, 0x18C, and 0x18D on processors that support 8 counters (typically requires that HyperThreading be disabled). Some people recommend zeroing the counters, but I don't see any reason for that -- I just take differences between counts and add 2^48 if the counter has wrapped (once) so that the final value is smaller than the initial value. If the timing interval becomes "large", you need to be aware of how fast the counter might increment so that you can compute the longest interval between reads that guarantees no more than 2^48 increments (so you can unambiguously detect and correct wraparounds).

Sometimes you can get away with using RDPMC if "perf events" has programmed the counters, but only for very short intervals since you can't be completely sure when/if the counter has been reset by the kernel.

Sergey_S_Intel2 · ‎10-21-2015

John, I agree with you, I fully support the idea what any virtualization methods keeps us far from real processes. In this case, we can talk about probability of the real CPU events instead of real knowledge what happened. Do you have an opinion why explicit programming of the control registers are a) quite complicated and b) hided from user-level? We have, at least, two options. First, all these problems came from HW architecture and other people (OS programmers) have no chance to make it clear and user friendly. Second, CPU architect provides a registers and this is OS developer’s caprice to make it useless (something like this I heard from HW people). Do you think we need enlarge width of the event counter registers (48->64)? Anyway, could you recommend me some examples how to configure (explicit counter usage model) control registers (assume I have superuser)? Sergey

McCalpinJohn · ‎10-21-2015

The performance counters are complicated largely because the hardware is complicated, and secondarily because Intel does not want to expose microarchitectural implementation details without good reason. (Patent trolls can be quite creative at re-interpreting the patents that they own to claim that a big company is violating the patents -- but they need to have some idea of how the processor is implemented to make these claims.)

Some aspects of hardware performance counters probably need to be restricted to elevated privilege levels. For example, configuring the hardware performance counters to generate interrupts has the potential to severely impact system performance and usability. On the other hand, most of what the performance counters do is perfectly safe -- the vendors do a very good job of ensuring that programming random bits into the performance counter control registers is "safe" -- you may not be able to interpret the results, but the processor runs just fine.

I prefer that the hardware performance counters remain as low-level features, and not as registers that get saved and restored on context switches. (It would be hard to use the counters to measure things like context switch overhead if they were swapped in and out.) But leaving the counters as "raw" low-level features means that they cannot easily be shared, and it means that they provide a potentially high-bandwidth covert channel between processes.

In the high performance computing world where I work, systems are seldom time-shared, so we don't really need to worry about either sharing the counters or about covert channels. In the production environment a job is assigned a set of nodes and no other user is allowed access to those nodes for the duration of the job. The nodes are still shared between the OS (and all its subsidiary processes) and the user (and all the auxiliary processes that the user might cause to be started), but since this is the standard mode of operation, dealing with this sharing is part of the performance puzzle that we are trying to understand.

To use the hardware performance counters manually, a variety of tools are needed:

For the hardware performance counters in the processor cores, I build the "rdmsr" and "wrmsr" command-line tools from "msrtools-1.2".
1. I use a script to configure the global configuration registers and the PERFEVTSEL* MSRs for the programmable core counters.
2. For whole-program measurements, I read the counters using the "rdmsr" program before and after the execution (taking care that the run is short enough that the counters can't be incremented more than 2^48 times during the run). You can also use "perf stat" for these sorts of measurements.
3. For interval measurements inside the code, I program the counters using the script, then use the RDPMC command to read them at the desired locations in the code.
For the "uncore" counters there are three different interfaces used, depending on the processor model:
1. Some "uncore" counters use MSRs and can be configured using "wrmsr" as above. Unfortunately these can only be read from inside the kernel (since the RDMSR instruction can only be executed at ring 0). If the program is being run by root (or is owned by root and has the setuid bit set), then the program can open the /dev/cpu/*/msr device files and read or write the counters using pread() or pwrite() calls. These are kernel calls so they cost a few thousand cycles each, but there is nothing that can be done about this. (One thing that could help is to build a kernel module that could return multiple MSR values with a single call.)
2. Some "uncore" counters are in "PCI configuration space". The root user can read/write these counters using the "setpci" command-line program. As with the MSR-based counters, a root user can open the device driver files (in /proc/bus/pci, I think) and read/write the counters using pread() and pwrite() commands (limited to 32-bit transactions).
3. Some processors include "uncore" counters in a different range of memory-mapped IO space. Working with these is an advanced topic....

Here is a fairly typical script that I use to set up the counters (edited for clarity):

#!/bin/bash

export NCORES=`cat /proc/cpuinfo | grep -c processor`
echo "Number of cores is $NCORES"
export MAXCORE=`expr $NCORES - 1`

# Enable all counters in IA32_PERF_GLOBAL_CTRL
#   bits 34:32 enabled the three fixed function counters
#	bits 7:0 enable the eight programmable counters
echo "Checking IA32_PERF_GLOBAL_CTRL on all cores"
echo "  (should be 00000007000000ff)"
for core in `seq 0 $MAXCORE`
do
	echo -n "$core "
	~/bin/rdmsr -p $core -x -0 0x38f
	~/bin/wrmsr -p $core 0x38f 0x00000007000000ff
done

# Core Performance Counter Event Select MSRs
#   Counter	 MSR
#	   0    0x186
#	   1    0x187
#	   2    0x188
#	   3    0x189
#	   4    0x18a
#	   5    0x18b
#	   6    0x18c
#	   7    0x18d

# Dump all performance counter event select registers on all cores
if [ 0 == 1 ]
then
	echo "Printing out all performance counter event select registers"
	echo "MSR    CORE    CurrentValue"
	for PMC_MSR in 186 187 188 189 18a 18b 18c 18d
	do
		for CORE in `seq 0 $MAXCORE`
		do
			echo -n "$PMC_MSR $CORE "
			~/bin/rdmsr -p $core -0 -x 0x${PMC_MSR}
		 done
	done
fi

# Counter 0 Uops Dispatched on Port 0		0x004301a1
# Counter 1 Uops Dispatched on Port 1		0x004302a1
# Counter 2 Uops Dispatched on Port 2		0x004304a1
# Counter 3 Uops Dispatched on Port 3		0x004308a1
# Counter 4 actual core cycles unhalted		0x0043003c
# Counter 5 Uops Dispatched on Port 5		0x004320a1
# Counter 6 cycles with no uops delivered from back end to
#   front end & there is no back end stall	0x0143019c
# Counter 7 Uops issued from RAT to RS		0x0043010e

echo "Programming counters 0,1,2,3"
for core in `seq 0 $MAXCORE`
do
	~/bin/wrmsr -p $core 0x186 0x004301a1
	~/bin/wrmsr -p $core 0x187 0x004302a1
	~/bin/wrmsr -p $core 0x188 0x004304a1
	~/bin/wrmsr -p $core 0x189 0x004308a1
	~/bin/wrmsr -p $core 0x18a 0x0043003c
	~/bin/wrmsr -p $core 0x18b 0x004320a1
	~/bin/wrmsr -p $core 0x18c 0x0143019c
	~/bin/wrmsr -p $core 0x18d 0x0043010e
done

Xiongchao_T_ · ‎07-07-2016

John, thank you for your answer. It is what I need. However, I meet some troubles when using your code. rdpmc_actual_cycles() works, but rdpmc_reference_cycles() and rdpmc_instructions() always return zero. You have mentioned that these counters may be not enabled by default, is it the reason why I get zeros ? How to enable the counters ?

John McCalpin wrote:

In recent Intel processors there are two ways to use the input argument for the RDPMC instruction.

Values of 0 to 3 (or 0 to 7) select one of the programmable performance counters.

Values of 2^30, 2^30+1, and 2^30+2 select one of the "fixed-function" performance counters. Documentation of this use is not very clear, and not particularly easy to find, so I usually just go back to my own code rather than trying to find it in the Intel documents.

The routines below provide access to each of the "fixed function" performance counter events with names that are easier to remember than the corresponding performance counter number.

Note that on some/many systems these fixed-function counters are either not enabled by default or they are enabled and in use by another process (sometimes the BIOS and sometimes the "NMI watchdog" process). If they are in use by another process they are probably configured to generate an interrupt on overflow, and the interrupt handler will reset the counter value every time. For example, the NMI watchdog on Linux systems often uses the "actual cycles" counter set up to overflow every 2 billion cycles (i.e., the counter is reset to (2^48-1 - 2^32) by the interrupt handler). In this case it is still perfectly safe to read the counter and it is still quite useful for measuring over short intervals (i.e., much less than 2 billion cycles) as long as you can do "sanity-checking" on the results and are able to discard the occasional results that are corrupted by the reset of the counter.
// rdpmc_instructions uses a "fixed-function" performance counter to return the count of retired instructions on
//       the current core in the low-order 48 bits of an unsigned 64-bit integer.
unsigned long rdpmc_instructions()
{
   unsigned a, d, c;

   c = (1<<30);
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

// rdpmc_actual_cycles uses a "fixed-function" performance counter to return the count of actual CPU core cycles
//       executed by the current core.  Core cycles are not accumulated while the processor is in the "HALT" state,
//       which is used when the operating system has no task(s) to run on a processor core.
unsigned long rdpmc_actual_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+1;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

// rdpmc_reference_cycles uses a "fixed-function" performance counter to return the count of "reference" (or "nominal")
//       CPU core cycles executed by the current core.  This counts at the same rate as the TSC, but does not count
//       when the core is in the "HALT" state.  If a timed section of code shows a larger change in TSC than in
//       rdpmc_reference_cycles, the processor probably spent some time in a HALT state.
unsigned long rdpmc_reference_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+2;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

McCalpinJohn · ‎07-07-2016

The fixed function counters are controlled by two MSRs.

IA32_PERF_GLOBAL_CTRL (MSR 0x38F), bits 34:32 must be set to enable the three fixed-function counters.
- These are usually set by default, but it is a good idea to check anyway.
IA32_FIXED_CTR_CTRL (MSR 0x38D), has 3 4-bit fields to control the fixed-function counters.
- For each counter, the bits are:
  - Bit 0 enables counting in kernel mode
  - Bit 1 enables counting in user mode
  - Bit 2 enables counting for any thread running on the core in a system supporting more than one logical processor per physical core
  - Bit 3 enables interrupts on overflow of this counter
- It is very common for the NMI watchdog to use one of these counters.
  - If this is the case then one of the counters will have the "interrupt on overflow" bit enabled.
  - A typical setting for this register is:
    - 0x0b0
    - Fixed Function Counter 2 is disabled (the high-order 4 bits are 0)
    - Fixed Function Counter 1 is enabled for user and kernel counts, and has the interrupt on overflow bit set
    - Fixed Function Counter 0 is disabled (the low-order 4 bits are 0)
- Disabling the NMI watchdog will typically clear the high-order bit of all three fields.
  - Then you can write 0x333 to enable user+kernel mode for all three counters.
- You can still read the counter if the NMI watchdog is using it, but you need to be aware that the counter value will be reset after it overflows. A typical configuration is to set it to overflow every 2 billion cycles, so if your measurements are short, then this won't happen very often.

A very irritating feature of the Linux kernel is that the "perf stat" command (or similar) will sometimes use these fixed function counters and will disable them on exit. A rational piece of software would check the initial state and restore that state on exit -- but the Linux "perf events" subsystem is nothing like rational....

Xiongchao_T_ · ‎07-07-2016

John, thank you again. BTW, I found that I can enable counters with perf_event interface, as shown by Thomas's code.

RDPMC works after the initialization code

void pmc_enable()
{
  int i, err;
  struct perf_event_attr attr_inst, attr_rcyc;
  int perf_hw_inst, perf_hw_refcyc;
  long long result = 0;

  // Configure th event
  memset(&attr_inst, 0, sizeof(struct perf_event_attr));
  attr_inst.type = PERF_TYPE_HARDWARE;
  attr_inst.size = sizeof(struct perf_event_attr);
  attr_inst.config = PERF_COUNT_HW_INSTRUCTIONS;
  attr_inst.inherit = 1;
  memset(&attr_rcyc, 0, sizeof(struct perf_event_attr));
  attr_rcyc.type = PERF_TYPE_HARDWARE;
  attr_rcyc.size = sizeof(struct perf_event_attr);
  attr_rcyc.config = PERF_COUNT_HW_REF_CPU_CYCLES;
  attr_rcyc.inherit = 1;

  // Due to the setting of attr.inherit, it will also count all child
  perf_hw_inst = perf_event_open(&attr_inst, 0, -1, -1, 0); 
  if (perf_hw_inst < 0) fprintf(stderr, "Failed to start HW_INSTRUCTIONS\n");
  perf_hw_refcyc = perf_event_open(&attr_rcyc, 0, -1, -1, 0); 
  if (perf_hw_refcyc < 0) fprintf(stderr, "Failed to start HW_REF_CPU_CYCCLES\n");

  // Resetting counter to zero
  ioctl(perf_hw_inst, PERF_EVENT_IOC_RESET, 0);
  ioctl(perf_hw_refcyc, PERF_EVENT_IOC_RESET, 0);
  // Start counters
  ioctl(perf_hw_inst, PERF_EVENT_IOC_ENABLE, 0); 
  ioctl(perf_hw_refcyc, PERF_EVENT_IOC_ENABLE, 0); 
}

Lawrence_M_Intel · ‎07-11-2016

You need to be careful using RDPMC together with with perf events, because the OS maintains its own idea of the correct count. To do this properly you need to mmap some space used by the kernel to store counter information and then follow a specific RDPMC code pattern. Andi Kleen's jevents code does this: https://github.com/andikleen/pmu-tools/tree/master/jevents

-- Larry

Deepak123 · ‎07-20-2023

I am trying to execute rdpmc in cloud container. rdpmc needs extra privileges so it's crashing on cloud because it doesn't have privileges. Can jevents will help in this scenario, to execute rdpmc on cloud?

please share your thoughts. Thanks

McCalpinJohn · ‎07-20-2023

If attempting to execute the RDPMC instruction is causing a crash, it probably means that the OS has disabled access by clearing the configuration bit CR4.PCE. This seems to be the new default for Linux kernels > 4. If this is the case, the old behavior can be restored by "echo 2 > /dev/sys/devices/cpu/rdpmc" (as root).

What the kernel developers want you to do is make all performance counter accesses through the perf events library (so you no longer have direct control over the access method). Other performance counter libraries such as Likwid and PAPI know how to call the perf events library.

Kumar_C_ · ‎11-11-2016

Hi All,

I am using Dr. John Mccalpin's code using the RDPMC for reading the instructions and cycles on the Intel Knights Landing core.

#define rdpmc(counter,low,high) \
     __asm__ __volatile__("rdpmc" \
        : "=a" (low), "=d" (high) \
        : "c" (counter))

#define NSAMPLES 1000

int main()
{
int j;
unsigned long values64[NSAMPLES];
unsigned int fixed0, low, high;
     fixed0= (1<<30)+2;
    for (j=0; j<NSAMPLES; j++) values64 = 0;  // make sure array is in cache

    for (j=0; j<NSAMPLES; j+=8) {
          rdpmc(fixed0, low, high);
          values64 = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+1] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+2] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+3] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+4] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+5] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+6] = ((unsigned long) high << 32) + (unsigned long) low;
          rdpmc(fixed0, low, high);
          values64[j+7] = ((unsigned long) high << 32) + (unsigned long) low;
      }
    for (j=0; j<NSAMPLES; j++) printf(" %d %lu\n", j, values64);  // make sure array is in cache

}

I have experimented with different fixed0 values in the above code. fixed0= (1<<30)+1; fixed0= (1<<30); When fixed0=(1<<30), the output value (all the values of values64[]) is always 985, and when fixed0=(1<<30)+1, the output value is always 6041, and when fixed0=(1<<30)+2, the output value is always 0.

I have repeated ran the above code, and the numbers quoted above are consistently the same.

I am running the code from user level (with out sudo access), I doubt if the above numbers are correct. Is the actual register being read or not?

I expect the pmc_enable() approach using ioctl to have the system call overhead, given that I intend to count the instructions of simple code with say 100 instructions, this overhead is important to consider.

Please point out any issues with the above code? Thanks.

McCalpinJohn · ‎11-13-2016

You need to read Chapter 18 of Volume 3 of the Intel Architectures Software Developer's Manual to understand the rest of the infrastructure that is required to use the performance counters.

The fixed-function performance counters are the easiest, but even they have two different MSRs that need to be set properly before they will increment.

On most Linux systems the default setting of the IA32_PERF_GLOBAL_CTRL register (MSR 0x38f) enables the fixed-function counters (bits 32,33,34 are each set to 1), but this is may be overridden so it is important to check.
On most Linux systems the default setting of the IA32_FIXED_CTR_CTRL register (MSR 0x38d) does *not* enable the fixed-function counters, or enables only one of them for use by the NMI Watchdog function.
- The bits in this register are described in Sections 18.2.2 and 18.2.3 of Volume 3 of the Intel Architectures Software Developer's Manual.
- If none of the counters are enabled by default, then they can all be enabled to increment in both user and kernel space by writing 0x333 to MSR 0x38d.
  - Be aware that the "perf stat" (or "perf record") facilities assume that they can use the fixed-function counters (without checking to see if they are already in use), and the code stupidly disables the counters after it uses them.
- If bit 3, bit 7, or bit 11 of MSR 0x38d is set, then some process has set up the performance counters to generate an interrupt on overflow. This is usually the NMI Watchdog, but it could be used by other privileged processes (or by the BIOS). If the counter is enabled, but the "interrupt on overflow" bit is set, you can still use RDPMC to read the counters -- but you need to be aware that the counter will be reset every time it overflows. A commonly used approach is to re-set the counter to 2^48-2^32 so that it will overflow and generate an interrupt every 2^32 increments. If you are measuring over intervals that are "short" compared to 2^32 increments, then most of your differences will be OK, but if the counter is re-set during the interval you may get differences that look negative.

Kumar_C_ · ‎11-17-2016

Thanks Dr. McCalpin.

The following is the code that seem to be working:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>

static long
perf_event_open (struct perf_event_attr *hw_event, pid_t pid,
		 int cpu, int group_fd, unsigned long flags)
{
  int ret;

  ret = syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
  return ret;
}

#define rdpmc(counter,low,high) \
     __asm__ __volatile__("rdpmc" \
        : "=a" (low), "=d" (high) \
        : "c" (counter))



int
main ()
{
  unsigned long values1, values2;
  unsigned int fixed0, low, high;
  struct perf_event_attr pe;
  int fd, i;

  fixed0 = (1 << 30);

  memset (&pe, 0, sizeof (struct perf_event_attr));
  pe.type = PERF_TYPE_HARDWARE;
  pe.size = sizeof (struct perf_event_attr);
  pe.config = PERF_COUNT_HW_INSTRUCTIONS;
  pe.disabled = 1;
  pe.exclude_kernel = 0;
  pe.exclude_hv = 0;
  pe.exclude_idle = 0;

  fd = perf_event_open (&pe, 0, -1, -1, 0);
  if (fd == -1)
    {
      fprintf (stderr, "Error opening leader %llx\n", pe.config);
      exit (EXIT_FAILURE);
    }
for (i=1; i<=50; i++)
{
  ioctl (fd, PERF_EVENT_IOC_RESET, 0);
  ioctl (fd, PERF_EVENT_IOC_ENABLE, 0);

  rdpmc (fixed0, low, high);
  values1 = ((unsigned long) high << 32) + (unsigned long) low;
  
  //test ()

  rdpmc (fixed0, low, high);
  values2 = ((unsigned long) high << 32) + (unsigned long) low;

  ioctl (fd, PERF_EVENT_IOC_DISABLE, 0);
  printf (" %lu\n", values2 );	
}
  close (fd);
}

i. How to measure two events (INSTRUCTIONS and REFERENCE_CYCLES) at a time using the rdpmc.

ii. If pe.exclude_kernel = 0;pe.exclude_hv = 0;pe.exclude_idle = 0; are used does for the measurement, does it account for all the instructions and cycles including OS daemons, interrupts etc.

iii. I have used the above code inside a MPI program, it appears to me that the measurements within the MPI processes are reasonable, do I need to pass any special input parameters to the perf_event_open().

McCalpinJohn · ‎11-18-2016

I have never used the perf events interface, so I don't know any of the details of how it is configuring the HW or SW.... (That is the primary reason why I don't use it -- it takes longer to figure out what it is measuring than it takes for me to set up exactly what I want manually.)

A couple of issues that may be related to what you are working on:

"perf events" has hooks in the OS scheduler code so that it can save and restore the counter programming and the counter counts at context switches.
"perf events" code also executes periodically (even without context switches) to read the counters and accumulate the deltas into a 64-bit "virtual counter" that won't overflow. This is probably integrated into the scheduler interrupt handler code, but it is possible to implement with an independent timer-based interrupt.
"perf events" does not appear to have a 1:1 mapping between event names (e.g., PERF_COUNT_HW_INSTRUCTIONS) and hardware configuration. My understanding is that when an event can be counted by either a fixed-function counter or by a programmable counter, perf_events will use the fixed-function counter if it is not currently in use (e.g., by the NMI Watchdog), and will use a programmable counter if the corresponding fixed-function counter is busy. The events should give the same results using either counter interface, but this behavior makes it harder to understand which control registers are being modified.
Both the fixed-function and programmable performance counters have a bit to enable counting in user mode and a bit to enable counting in kernel mode. These clearly work as intended at the large scale, but interrupt & exception handlers will have some instructions in user mode and some in kernel mode, so detailed counts may differ from expectations.

The fixed-function counters are independent of each other, so as long as they are enabled you can read any or all of them. Again, I don't know how to do this with the perf_events interface.

For MPI code you should not need to do anything special. MPI functions will have a combination of user-space and kernel-space activity, just like any other IO. Pinning the thread under observation to a single logical processor is almost always a good idea when using performance counters.

How to read performance counters by rdpmc instruction?