- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Modern CPUs have quite a lot of performance counters, how to read them? I know many performance monitoring and profiler programs and libraries (PAPI, Vtune, profiler, linux_perf, etc.) but all these methods requires additional computation time (intrusiveness).
Also, modern Intel CPUs support rdpmc instruction and Linux OS (currently) support this instruction in user-level.
I would like to understand how to use GCC intrinsic to get CPU cycles and instruction executed to profile some function in my C code.
I understand I have to pin program execution to particular CPU core. Let’s assume the CPU is Haswell.
I appreciate for some small example of rdpmc usage.
For example, the code might looks like this
long long get_cycles(){ unsigned int a=0, d=0; int ecx=(1<<30)+1; //What counter it selects? __asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx)); return ((long long)a) | (((long long)d) << 32); } int main (int argc, char* argv[]) { long long start, finish; start = get_cycles(); for (i = 0; i < 1000; i++) { foo(); } finish = get_cycles(); printf("Total cycles : %ld\n",((double)(finish-start))/1000.0); return 0; }
What ecx variable in get_cycles() must contain to provide CPU cycles and Instruction executed?
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The performance counters are complicated largely because the hardware is complicated, and secondarily because Intel does not want to expose microarchitectural implementation details without good reason. (Patent trolls can be quite creative at re-interpreting the patents that they own to claim that a big company is violating the patents -- but they need to have some idea of how the processor is implemented to make these claims.)
Some aspects of hardware performance counters probably need to be restricted to elevated privilege levels. For example, configuring the hardware performance counters to generate interrupts has the potential to severely impact system performance and usability. On the other hand, most of what the performance counters do is perfectly safe -- the vendors do a very good job of ensuring that programming random bits into the performance counter control registers is "safe" -- you may not be able to interpret the results, but the processor runs just fine.
I prefer that the hardware performance counters remain as low-level features, and not as registers that get saved and restored on context switches. (It would be hard to use the counters to measure things like context switch overhead if they were swapped in and out.) But leaving the counters as "raw" low-level features means that they cannot easily be shared, and it means that they provide a potentially high-bandwidth covert channel between processes.
In the high performance computing world where I work, systems are seldom time-shared, so we don't really need to worry about either sharing the counters or about covert channels. In the production environment a job is assigned a set of nodes and no other user is allowed access to those nodes for the duration of the job. The nodes are still shared between the OS (and all its subsidiary processes) and the user (and all the auxiliary processes that the user might cause to be started), but since this is the standard mode of operation, dealing with this sharing is part of the performance puzzle that we are trying to understand.
To use the hardware performance counters manually, a variety of tools are needed:
- For the hardware performance counters in the processor cores, I build the "rdmsr" and "wrmsr" command-line tools from "msrtools-1.2".
- I use a script to configure the global configuration registers and the PERFEVTSEL* MSRs for the programmable core counters.
- For whole-program measurements, I read the counters using the "rdmsr" program before and after the execution (taking care that the run is short enough that the counters can't be incremented more than 2^48 times during the run). You can also use "perf stat" for these sorts of measurements.
- For interval measurements inside the code, I program the counters using the script, then use the RDPMC command to read them at the desired locations in the code.
- For the "uncore" counters there are three different interfaces used, depending on the processor model:
- Some "uncore" counters use MSRs and can be configured using "wrmsr" as above. Unfortunately these can only be read from inside the kernel (since the RDMSR instruction can only be executed at ring 0). If the program is being run by root (or is owned by root and has the setuid bit set), then the program can open the /dev/cpu/*/msr device files and read or write the counters using pread() or pwrite() calls. These are kernel calls so they cost a few thousand cycles each, but there is nothing that can be done about this. (One thing that could help is to build a kernel module that could return multiple MSR values with a single call.)
- Some "uncore" counters are in "PCI configuration space". The root user can read/write these counters using the "setpci" command-line program. As with the MSR-based counters, a root user can open the device driver files (in /proc/bus/pci, I think) and read/write the counters using pread() and pwrite() commands (limited to 32-bit transactions).
- Some processors include "uncore" counters in a different range of memory-mapped IO space. Working with these is an advanced topic....
Here is a fairly typical script that I use to set up the counters (edited for clarity):
#!/bin/bash export NCORES=`cat /proc/cpuinfo | grep -c processor` echo "Number of cores is $NCORES" export MAXCORE=`expr $NCORES - 1` # Enable all counters in IA32_PERF_GLOBAL_CTRL # bits 34:32 enabled the three fixed function counters # bits 7:0 enable the eight programmable counters echo "Checking IA32_PERF_GLOBAL_CTRL on all cores" echo " (should be 00000007000000ff)" for core in `seq 0 $MAXCORE` do echo -n "$core " ~/bin/rdmsr -p $core -x -0 0x38f ~/bin/wrmsr -p $core 0x38f 0x00000007000000ff done # Core Performance Counter Event Select MSRs # Counter MSR # 0 0x186 # 1 0x187 # 2 0x188 # 3 0x189 # 4 0x18a # 5 0x18b # 6 0x18c # 7 0x18d # Dump all performance counter event select registers on all cores if [ 0 == 1 ] then echo "Printing out all performance counter event select registers" echo "MSR CORE CurrentValue" for PMC_MSR in 186 187 188 189 18a 18b 18c 18d do for CORE in `seq 0 $MAXCORE` do echo -n "$PMC_MSR $CORE " ~/bin/rdmsr -p $core -0 -x 0x${PMC_MSR} done done fi # Counter 0 Uops Dispatched on Port 0 0x004301a1 # Counter 1 Uops Dispatched on Port 1 0x004302a1 # Counter 2 Uops Dispatched on Port 2 0x004304a1 # Counter 3 Uops Dispatched on Port 3 0x004308a1 # Counter 4 actual core cycles unhalted 0x0043003c # Counter 5 Uops Dispatched on Port 5 0x004320a1 # Counter 6 cycles with no uops delivered from back end to # front end & there is no back end stall 0x0143019c # Counter 7 Uops issued from RAT to RS 0x0043010e echo "Programming counters 0,1,2,3" for core in `seq 0 $MAXCORE` do ~/bin/wrmsr -p $core 0x186 0x004301a1 ~/bin/wrmsr -p $core 0x187 0x004302a1 ~/bin/wrmsr -p $core 0x188 0x004304a1 ~/bin/wrmsr -p $core 0x189 0x004308a1 ~/bin/wrmsr -p $core 0x18a 0x0043003c ~/bin/wrmsr -p $core 0x18b 0x004320a1 ~/bin/wrmsr -p $core 0x18c 0x0143019c ~/bin/wrmsr -p $core 0x18d 0x0043010e done
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In recent Intel processors there are two ways to use the input argument for the RDPMC instruction.
Values of 0 to 3 (or 0 to 7) select one of the programmable performance counters.
Values of 2^30, 2^30+1, and 2^30+2 select one of the "fixed-function" performance counters. Documentation of this use is not very clear, and not particularly easy to find, so I usually just go back to my own code rather than trying to find it in the Intel documents.
The routines below provide access to each of the "fixed function" performance counter events with names that are easier to remember than the corresponding performance counter number.
Note that on some/many systems these fixed-function counters are either not enabled by default or they are enabled and in use by another process (sometimes the BIOS and sometimes the "NMI watchdog" process). If they are in use by another process they are probably configured to generate an interrupt on overflow, and the interrupt handler will reset the counter value every time. For example, the NMI watchdog on Linux systems often uses the "actual cycles" counter set up to overflow every 2 billion cycles (i.e., the counter is reset to (2^48-1 - 2^32) by the interrupt handler). In this case it is still perfectly safe to read the counter and it is still quite useful for measuring over short intervals (i.e., much less than 2 billion cycles) as long as you can do "sanity-checking" on the results and are able to discard the occasional results that are corrupted by the reset of the counter.
// rdpmc_instructions uses a "fixed-function" performance counter to return the count of retired instructions on // the current core in the low-order 48 bits of an unsigned 64-bit integer. unsigned long rdpmc_instructions() { unsigned a, d, c; c = (1<<30); __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); return ((unsigned long)a) | (((unsigned long)d) << 32);; } // rdpmc_actual_cycles uses a "fixed-function" performance counter to return the count of actual CPU core cycles // executed by the current core. Core cycles are not accumulated while the processor is in the "HALT" state, // which is used when the operating system has no task(s) to run on a processor core. unsigned long rdpmc_actual_cycles() { unsigned a, d, c; c = (1<<30)+1; __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); return ((unsigned long)a) | (((unsigned long)d) << 32);; } // rdpmc_reference_cycles uses a "fixed-function" performance counter to return the count of "reference" (or "nominal") // CPU core cycles executed by the current core. This counts at the same rate as the TSC, but does not count // when the core is in the "HALT" state. If a timed section of code shows a larger change in TSC than in // rdpmc_reference_cycles, the processor probably spent some time in a HALT state. unsigned long rdpmc_reference_cycles() { unsigned a, d, c; c = (1<<30)+2; __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); return ((unsigned long)a) | (((unsigned long)d) << 32);; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Thank you very much. I completely agree with you – Intel documentation has no clear explanation of the rdpmc instruction usage.
As I understand, these counter numbers depend on CPU family. It can be detected by cpuid instruction. As described in "linux_perf" interface we have some "common" counters that supported on many CPUs (not intel only).
Could you please share these counters if you have such information?
To summarize previous info:
enum CPUCounters { cpuCOUNT_HW_INSTRUCTIONS = 1<<30, //count of retired instructions on the current core in the low-order 48 bits of an unsigned 64-bit integer cpuCOUNT_HW_CPU_CYCLES = (1<<30)+1,// count of actual CPU core cycles executed by the current core. Core cycles are not accumulated while the processor is in the "HALT" state, which is used when the operating system has no task(s) to run on a processor core. cpuCOUNT_HW_REF_CPU_CYCLES = (1<<30)+2, //count of "reference" (or "nominal") CPU core cycles executed by the current core. This counts at the same rate as the TSC, but does not count when the core is in the "HALT" state. If a timed section of code shows a larger change in TSC than in rdpmc_reference_cycles, the processor probably spent some time in a HALT state. cpuSIZE };
Also, it could be interesting how to detect if counter used by another program. In case of watchdog, I can detect it by reading /proc/sys/kernel/nmi_watchdog file on Linux. Is there any general way to understand if particular counter used by some other process?
How to clean (set to zero) these counters?
Modern Linux kernels allow rdpmc in user-level. If run this instruction on relatively old kernels the program crashes.
How to detect ability to run rdmpc instruction in runtime? As I understand, I have to read some special bit of the special register but has no example how to do it.
Thank you very much
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(1) To see if the RDPMC instruction is allowed at runtime, just try to use it and build an exception handler to catch the signal if one is thrown. The test code I use is:
#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <signal.h> #include <sched.h> #include <string.h> #include <errno.h> #define FATAL(fmt,args...) do { \ ERROR(fmt, ##args); \ exit(1); \ } while (0) #define ERROR(fmt,args...) \ fprintf(stderr, fmt, ##args) #define rdpmc(counter,low,high) \ __asm__ __volatile__("rdpmc" \ : "=a" (low), "=d" (high) \ : "c" (counter)) int cpu, nr_cpus; void handle ( int sig ) { FATAL("cpu %d: caught %d\n", cpu, sig); } int main ( int argc, char *argv[] ) { nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); for (cpu = 0; cpu < nr_cpus; cpu++) { pid_t pid = fork(); if (pid == 0) { cpu_set_t cpu_set; CPU_ZERO(&cpu_set); CPU_SET(cpu, &cpu_set); if (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set) < 0) FATAL("cannot set cpu affinity: %m\n"); signal(SIGSEGV, &handle); unsigned int low, high; rdpmc(0, low, high); ERROR("cpu %d: low %u, high %u\n", cpu, low, high); break; } } return 0; }
(2) The fixed-function performance counters are the same on all recent Intel processors. As part of the "Architectural Performance Monitoring" facility, they should not change (at least not very often!).
The fixed-function counters are described in Section 18.2 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384, revision 055). Recent processors typically support Architectural Performance Monitoring Version 3, which is described in Section 18.2.3, but looking through Chapter 19 of Volume 3 of the Intel Architectures Software Developer's Manual, it looks like these fixed events are the same all the way back to the Core processor and are also supported the Atom processors (as well as all of the more recent processors).
The specific events counted by the three fixed-function architectural performance counters are described in Table 19-2 of Section 19.1 "Architectural Performance Monitoring Events". The assignment of function to the MSR address of the fixed-function counters is definitely fixed and the fixed-function counters are referred to as FIXED_CTR_0, FIXED_CTR_1, and FIXED_CTR2. It seems extremely unlikely that Intel would change the mapping of the RDPMC counter numbers to access these using anything other than the obvious approach of 1<<30, 1<<30+1, and 1<<30+2.
(3) Control of the counters is through MSRs. The MSRs relating to performance counters are described in Chapters 18 and 35 of Volume 3 of the Intel Software Developer's Manual.
- Linux exposes a device driver for the MSRs via the /dev/cpu/*/msr interfaces.
- The command-line tools "rdmsr" and "wrmsr" from "msrtools-1.2" provide an easy to use interface to read and write MSRs.
- By default, root access is required to read or write the /dev/cpu/*/msr device drivers.
- You can run "rdmsr" and "wrmsr" from a root account, or
- You can chgrp the /dev/cpu/*/msr files to a group that your user account belongs to and then chmod the /dev/cpu/*/msr files to give group read/write permissions, or
- You can change the ownership of the rdmsr and wrmsr binaries to root and mark them as "setuid", or
- You can write your own loadable kernel module to do exactly what you need.
- In the most recent versions of Linux you may also need to fiddle with "capabilities" to enable access -- I don't know how this works.
(4) There is no general "reservation" mechanism for the counters, but it is pretty easy to tell if the fixed-function counters are in use.
- First you need to look at the IA32_PERF_GLOBAL_CTRL MSR (0x38F) to see if the counters are globally enabled. This is described in each of the subsections of Section 18.2 (Architectural Performance Monitoring) of Volume 3 of the Software Developer's Guide, as well as in Chapter 35. There is one bit to enable each of the fixed-function counters and one bit to enable each of the programmable counters.
- Next you need to check the IA32_FIXED_CTR_CTRL MSR (0x38D) (described in the same places). This MSR determines whether the event counts in user mode or kernel mode or both, whether the event counts for only the logical processor that programmed it or for both logical processors that share a physical core (when HyperThreading is enabled), and whether the counter generates a Performance Monitor Interrupt (PMI) when it overflows its 48-bit range.
- A fixed-function counter is almost certainly in use if its PMI bit in the IA32_FIXED_CTR_CTRL MSR is set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for delay, I had been overloaded by other tasks.
John,
I found HSW CPU E5-2697 v3 @ 2.60GHz with Red Hat Enterprise Linux Server release 6.5 (Santiago) and kernel 2.6.32-358.6.2.el6.x86_64 machine with rdpmc enabled.
It doesn’t fail if I use rdpmc instruction. I ran your test code and found:
cpu 0: low 4085649949, high 65535 cpu 1: low 3651737778, high 65535 cpu 2: low 3404553720, high 65535 cpu 3: low 2885785273, high 65535 cpu 4: low 2163297754, high 65535 cpu 5: low 3387747633, high 65535 cpu 6: low 4036661582, high 65535 cpu 7: low 4254544390, high 65535 cpu 8: low 2344492980, high 65535 cpu 9: low 3150679521, high 65535 cpu 10: low 3459804814, high 65535 cpu 11: low 3361664909, high 65535 … etc
What these numbers mean in case of the test use rdpmc(0, low, high);? Why “low” is different on different CPUs?
Thomas,
The read() system call has high intrusiveness. I integrated these Perf_events code into my test to initialize performance counter system.
I use following main loop (some unimportant code, like output, removed):
int perf_fds; void init_instructions() { struct perf_event_attr attr; memset(&attr, 0, sizeof(struct perf_event_attr)); attr.type = PERF_TYPE_HARDWARE; attr.size = sizeof(struct perf_event_attr); attr.config = PERF_COUNT_HW_INSTRUCTIONS; attr.inherit = 1; attr.pinned = 1; attr.exclude_idle = 1; attr.exclude_kernel = 1; perf_fds = perf_event_open(&attr, 0, -1, -1, 0); ioctl(perf_fds, PERF_EVENT_IOC_RESET, 0); // Resetting counter to zero ioctl(perf_fds, PERF_EVENT_IOC_ENABLE, 0); // Start counters } int main () { init_instructions(); for(int attempts = 0; attempts <= 20; ++attempts) { rdtscp(&chipOrig, &coreOrig); foo(loop, &tmp, &attStart, &attEnd, timerData.data(), &chip, &core, perf_fds); } close_instructions(); return 0; }
It call foo() from other object file to guarantee independent state for compiler.
void foo(int loop, long long *tmp, long long *attStart, long long *attEnd, long long *timerData, unsigned long *chip, unsigned long *core, int pId) { *attStart = rdtsc(); for(int i = 0; i < loop; i++) { long long start = read_perf_instructions(pId);//__builtin_ia32_rdpmc(INSTR_COUNT); *tmp += rdtsc(); timerData = read_perf_instructions(pId) - start; } *attEnd = rdtscp(chip, core); }
In this loop I measure number of the instruction between read_perf_instructions(pId) calls.
If I use read() system call to read perf_event counter I get following output:
Loop iterations 65536, result vector 524288 bytes Iter Average Min Max Median First 10 values 0 1445.3 20 21 20 20 20 20 20 20 20 20 20 20 20 1 1439.2 20 21 20 20 20 20 20 20 20 20 20 20 20 2 1442.4 20 21 20 20 20 20 20 20 20 20 20 20 20 3 1441.3 20 21 20 20 20 20 20 20 20 20 20 20 20 4 1441.4 20 21 20 20 20 20 20 20 20 20 20 20 20 … 16 1439.2 20 21 20 20 20 20 20 20 20 20 20 20 20 17 1438.3 20 21 20 20 20 20 20 20 20 20 20 20 20 18 1438.9 20 21 20 20 20 20 20 20 20 20 20 20 20 19 1439.7 20 21 20 20 20 20 20 20 20 20 20 20 20 20 1438.7 20 20 20 20 20 20 20 20 20 20 20 20 20
Average means “(attEnd-attStart) / loop” in the listing above
Min, Max and Median are from the vector timerData[];
The loop in foo() looks following in assembler
401876: 4c 8d 7c 24 28 lea 0x28(%rsp),%r15 40187b: 4d 89 c5 mov %r8,%r13 40187e: 45 31 e4 xor %r12d,%r12d 401881: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 401888: ba 08 00 00 00 mov $0x8,%edx 40188d: 4c 89 fe mov %r15,%rsi 401890: 44 89 f7 mov %r14d,%edi 401893: 48 c7 44 24 28 00 00 movq $0x0,0x28(%rsp) 40189a: 00 00 40189c: e8 57 ef ff ff callq 4007f8 <read@plt> 4018a1: 48 8b 44 24 28 mov 0x28(%rsp),%rax 4018a6: 48 89 44 24 08 mov %rax,0x8(%rsp) 4018ab: 0f 31 rdtsc 4018ad: 48 c1 e2 20 shl $0x20,%rdx 4018b1: 4c 89 fe mov %r15,%rsi 4018b4: 44 89 f7 mov %r14d,%edi 4018b7: 48 8d 04 02 lea (%rdx,%rax,1),%rax 4018bb: 48 01 03 add %rax,(%rbx) 4018be: ba 08 00 00 00 mov $0x8,%edx 4018c3: 48 c7 44 24 28 00 00 movq $0x0,0x28(%rsp) 4018ca: 00 00 4018cc: 41 83 c4 01 add $0x1,%r12d 4018d0: e8 23 ef ff ff callq 4007f8 <read@plt> 4018d5: 48 8b 44 24 28 mov 0x28(%rsp),%rax 4018da: 48 2b 44 24 08 sub 0x8(%rsp),%rax 4018df: 49 89 45 00 mov %rax,0x0(%r13) 4018e3: 49 83 c5 08 add $0x8,%r13 4018e7: 44 39 e5 cmp %r12d,%ebp 4018ea: 7f 9c jg 401888 <foo+0x48> 4018ec: 0f 01 f9 rdtscp
As you can see we have 11 instructions between read() calls but reported 20 in the listing above.
As I understand, I can initialize the counters by perf_event system and use rdpmc instruction later to get the counter. (BTW, did you use __builtin_ia32_rdpmc gcc intrincic? I can’t compile it with gcc.)
I just replace read() by rdpmc(1<<30), as John mentioned, and found:
Loop iterations 65536, result vector 524288 bytes Iter Average Min Max Median First 10 values 0 86.6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 86.3 0 0 0 0 0 0 0 0 0 0 0 0 0 2 87.2 0 0 0 0 0 0 0 0 0 0 0 0 0 3 86.3 0 0 0 0 0 0 0 0 0 0 0 0 0 … 17 86.3 0 0 0 0 0 0 0 0 0 0 0 0 0 18 86.3 0 0 0 0 0 0 0 0 0 0 0 0 0 19 86.9 0 0 0 0 0 0 0 0 0 0 0 0 0 20 86.2 0 0 0 0 0 0 0 0 0 0 0 0 0
In assembler it look like:
4017ec: b9 00 00 00 40 mov $0x40000000,%ecx 4017f1: 48 8d 2c fd 08 00 00 lea 0x8(,%rdi,8),%rbp 4017f8: 00 4017f9: 31 ff xor %edi,%edi 4017fb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 401800: 0f 33 rdpmc 401802: 48 89 d3 mov %rdx,%rbx 401805: 49 89 c3 mov %rax,%r11 401808: 0f 31 rdtsc 40180a: 48 c1 e2 20 shl $0x20,%rdx 40180e: 48 8d 04 02 lea (%rdx,%rax,1),%rax 401812: 48 01 06 add %rax,(%rsi) 401815: 0f 33 rdpmc 401817: 48 c1 e2 20 shl $0x20,%rdx 40181b: 48 c1 e3 20 shl $0x20,%rbx 40181f: 48 8d 04 02 lea (%rdx,%rax,1),%rax 401823: 4e 8d 1c 1b lea (%rbx,%r11,1),%r11 401827: 4c 29 d8 sub %r11,%rax 40182a: 49 89 04 38 mov %rax,(%r8,%rdi,1) 40182e: 48 83 c7 08 add $0x8,%rdi 401832: 48 39 ef cmp %rbp,%rdi 401835: 75 c9 jne 401800 <foo+0x30> 401837: 0f 01 f9 rdtscp
What I did wrong? Why I get zero instead pre-initialized HW_INSTRUCTION counter?
I need to measure quite short events in the program and need some way to use low-intrusiveness method to, at least, reading the PMU counters.
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- The test_rdpmc program just reads the current values in PMC 0 and prints out the lower 32-bits and upper 32-bits instead of combining them into a 64-bit value. There is no deep meaning here -- as long as the program does not have an illegal instruction fault then the counters are readable. The results are different because each core has accumulated different counts (and the code does not attempt to read them at the same time anyway -- it uses sched_setaffinity() to bind to one core at a time and then reads counter 0 on that core using a simple inline assembly macro.)
- In your results the high-order counts are all 65535, which means that this counter has been set to be very close to the overflow threshold. The smallest of the low-order counts is just over 2^31, which is consistent with the counters being set to the overflow value minus 2^31, so they will overflow and generate an interrupt every 2 billion events. This is a typical use model for sampling-based performance monitoring, but it does make it trickier to use the counters for interval analysis, since they are getting reset frequently (and since the CPU is receiving interrupts to process these overflows fairly frequently).
- There is definitely timing overhead in reading the counters, though it varies slightly across processor models. The RDTSC, RDTSCP, and RDPMC calls all take cycles -- somewhere in the range of 25 cycles to 42 cycles.
- When counting instructions, I have seen exactly the results I expected from inline RDPMC calls using the fixed-function counter 0. For an unrolled loop that shows 6 instructions for each RDPMC call, I see the counter increment by 6 each time until the end of the loop where the loop control instructions increase the number of instructions to 10 -- also correctly reported. More details below....
- It is not clear that you verified that the fixed-function counters were enabled. MSR 0x38d IA32_FIXED_CTR_CTRL and MSR 0x38F IA32_PERF_GLOBAL_CTRL must both be set up correctly to enable the fixed-function counters to operate. This is described in Section 18.2 of Volume 3 of the Intel Architecture Software Developer's Manual.
Example of testing overhead with the RDPMC Fixed Counter 0 (Retired Instructions) event:
The code simply reads the counter repeatedly and saves the values in an array. I keep the number of iterations short so that the array will stay in cache.
One example looks like:
#define rdpmc(counter,low,high) \ __asm__ __volatile__("rdpmc" \ : "=a" (low), "=d" (high) \ : "c" (counter)) for (j=0; j<NSAMPLES; j++) values64= 0; // make sure array is in cache for (j=0; j<NSAMPLES; j+=8) { rdpmc(fixed0, low, high); values64 = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+1] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+2] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+3] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+4] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+5] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+6] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+7] = ((unsigned long) high << 32) + (unsigned long) low; }
With the Intel compiler the first rdpmc+store groups are compiled to:
rdpmc #211.0 movl %edx, %edx #212.38 movl %eax, %eax #212.68 shlq $32, %rdx #212.46 addq %rdx, %rax #212.68 movq %rax, 8+values64(,%rbx,8) #212.5
with gcc I see slightly different code -- 5 instructions per invocation:
rdpmc salq $32, %rdx mov %eax, %eax leaq (%rdx,%rax), %rax movq %rax, values64(%rsi)
with a little bit of thought this can be reduced to 3 instructions for the repeated iterations (and 6 instructions for the final iteration in the unrolled loop that requires the extra increment/compare/branch). At one point I ran across a version of gcc that did this automagically, but now I find that I have to write the code with explicit 32-bit stores to get this result:
rdpmc movl %edx, (%rsi) movl %eax, 4(%rsi)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Thank you for your comments. In case of #5 I used init_instructions() procedure described in a listing above. That example assumes no explicit check for proper configuration of the PMU control registers.
init_instructions() uses Linux_perf interface to configure PMU counter. The idea is to use Linux_perf to configure counter and use rdpmc instruction to read the counter.
How did you configure “Fixed Counter 0” to read it by rdpmc in last example?
Sergey
Update.
I found the issue in the code I used to measure instruction count. This is __builtin_ia32_rdpmc GCC intrinsic. The GCC 5.0 generates quite funny code that uses only one rdpmc instruction and, as consequence, the difference between calls became zero.
Also, I found how to use rdpmc directly with linux_perf interface initialization. But this way is provides more questions.
The standard linux_perf way to get instruction count measurements is in getting the counter value by read() system call.
The C source loop is
void foo(int loop, long long *tmp, long long *attStart, long long *attEnd, long long *timerData, unsigned long *chip, unsigned long *core, int pId) { int i = 0; *attStart = rdtsc(); for(i = 0; i < loop; i++) { long long start = read_perf_instructions(pId); *tmp += rdtsc(); long long stop = read_perf_instructions(pId); timerData = stop - start; } *attEnd = rdtscp(chip, core); }
402140: 48 8d 74 24 18 lea 0x18(%rsp),%rsi 402145: ba 08 00 00 00 mov $0x8,%edx 40214a: 89 ef mov %ebp,%edi 40214c: 48 c7 44 24 18 00 00 movq $0x0,0x18(%rsp) 402153: 00 00 402155: e8 46 ec ff ff callq 400da0 <read@plt> 40215a: 4c 8b 64 24 18 mov 0x18(%rsp),%r12 40215f: 0f 31 rdtsc 402161: 48 c1 e2 20 shl $0x20,%rdx 402165: 48 8d 74 24 18 lea 0x18(%rsp),%rsi 40216a: 89 ef mov %ebp,%edi 40216c: 48 01 d0 add %rdx,%rax 40216f: 48 01 03 add %rax,(%rbx) 402172: ba 08 00 00 00 mov $0x8,%edx 402177: 48 c7 44 24 18 00 00 movq $0x0,0x18(%rsp) 40217e: 00 00 402180: 49 83 c6 08 add $0x8,%r14 402184: e8 17 ec ff ff callq 400da0 <read@plt> 402189: 48 8b 44 24 18 mov 0x18(%rsp),%rax 40218e: 4c 29 e0 sub %r12,%rax 402191: 49 89 46 f8 mov %rax,-0x8(%r14) 402195: 4d 39 ee cmp %r13,%r14 402198: 75 a6 jne 402140 <_Z3fooiPxS_S_S_PmS0_i+0x40> 40219a: 0f 01 f9 rdtscp
We have 11 instruction between read() calls but the call itself has some unknown number of instruction. The test shows number of instructions and time (in Average field) spent in one loop iteration.
Loop iterations 65536, result vector 524288 bytes Iter Average Min Max Median First 10 values 0 1548.4 790 6282 790 790 790 790 790 790 790 790 790 790 790 1 1547.2 790 3968 790 790 790 790 790 790 790 790 790 790 790 2 1546.2 790 3973 790 790 790 790 790 790 790 790 790 790 790 3 1546.6 790 3973 790 790 790 790 790 790 790 790 790 790 790 4 1546.0 790 3973 790 790 790 790 790 790 790 790 790 790 790 5 1544.9 790 3973 790 790 790 790 790 790 790 790 790 790 790 6 1544.4 790 5040 790 790 790 790 790 790 790 790 790 790 790 7 1544.4 790 3973 790 790 790 790 790 790 790 790 790 790 790 8 1544.2 790 3973 790 790 790 790 790 790 790 790 790 790 790 9 1544.6 790 3973 790 790 790 790 790 790 790 790 790 790 790 10 1544.4 790 3973 790 790 790 790 790 790 790 790 790 790 790 11 1540.1 790 6527 790 790 790 790 790 790 790 790 790 790 790 12 1544.8 790 4053 790 790 790 790 790 790 790 790 790 790 790 13 1555.3 790 3973 790 790 790 790 790 790 790 790 790 790 790 14 1556.5 782 3973 790 816 790 790 790 790 790 790 790 790 790 15 1555.6 790 3973 790 790 790 790 790 790 790 790 790 790 790 16 1555.9 790 3973 790 790 790 790 790 790 790 790 790 790 790 17 1553.7 790 3973 790 790 790 790 790 790 790 790 790 790 790 18 1554.7 790 3973 790 790 790 790 790 790 790 790 790 790 790 19 1554.8 790 3973 790 790 790 790 790 790 790 790 790 790 790 20 1554.7 790 4668 790 790 790 790 790 790 790 790 790 790 790
If we change read() system call by rdpmc (inside read_perf_instructions(pId) function ) instruction we can see different picture
4020e0: 0f 33 rdpmc 4020e2: 49 03 42 10 add 0x10(%r10),%rax 4020e6: 48 c1 e2 20 shl $0x20,%rdx 4020ea: 48 8d 3c 10 lea (%rax,%rdx,1),%rdi 4020ee: 0f 31 rdtsc 4020f0: 48 c1 e2 20 shl $0x20,%rdx 4020f4: 48 01 d0 add %rdx,%rax 4020f7: 48 01 06 add %rax,(%rsi) 4020fa: 0f 33 rdpmc 4020fc: 49 03 42 10 add 0x10(%r10),%rax 402100: 48 c1 e2 20 shl $0x20,%rdx 402104: 49 83 c0 08 add $0x8,%r8 402108: 48 01 c2 add %rax,%rdx 40210b: 48 29 fa sub %rdi,%rdx 40210e: 49 89 50 f8 mov %rdx,-0x8(%r8) 402112: 49 39 d8 cmp %rbx,%r8 402115: 75 c9 jne 4020e0 <_Z3fooiPxS_S_S_PmS0_i+0x30> 402117: 0f 01 f9 rdtscp
Loop iterations 65536, result vector 524288 bytes Iter Average Min Max Median First 10 values 0 85.1 8 8 8 8 8 8 8 8 8 8 8 8 8 1 86.8 8 9 8 8 8 8 8 8 8 8 8 8 8 2 84.3 8 8 8 8 8 8 8 8 8 8 8 8 8 3 86.6 8 8 8 8 8 8 8 8 8 8 8 8 8 4 85.3 8 9 8 8 8 8 8 8 8 8 8 8 8 5 85.8 8 8 8 8 8 8 8 8 8 8 8 8 8 6 86.4 8 9 8 8 8 8 8 8 8 8 8 8 8 7 84.8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 86.2 8 9 8 8 8 8 8 8 8 8 8 8 8 9 85.3 8 9 8 8 8 8 8 8 8 8 8 8 8 10 86.2 8 9 8 8 8 8 8 8 8 8 8 8 8 11 85.3 8 9 8 8 8 8 8 8 8 8 8 8 8 12 85.6 8 9 8 8 8 8 8 8 8 8 8 8 8 13 85.3 8 8 8 8 8 8 8 8 8 8 8 8 8 14 84.6 8 8 8 8 8 8 8 8 8 8 8 8 8 15 86.2 8 9 8 8 8 8 8 8 8 8 8 8 8 16 86.0 8 9 8 8 8 8 8 8 8 8 8 8 8 17 86.2 8 9 8 8 8 8 8 8 8 8 8 8 8 18 85.3 8 9 8 8 8 8 8 8 8 8 8 8 8 19 85.9 8 9 8 8 8 8 8 8 8 8 8 8 8 20 86.2 8 9 8 8 8 8 8 8 8 8 8 8 8
This is expected behavior but the linux_perf has lot of configuration knobs and I suspect it can works in unexpected manner in some situations.
In the documentation here http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html we have usage example
do { seq = pc->lock; barrier(); enabled = pc->time_enabled; running = pc->time_running; if (pc->cap_usr_time && enabled != running) { cyc = rdtsc(); time_offset = pc->time_offset; time_mult = pc->time_mult; time_shift = pc->time_shift; } idx = pc->index; count = pc->offset; if (pc->cap_usr_rdpmc && idx) { width = pc->pmc_width; count += rdpmc(idx - 1); } barrier(); } while (pc->lock != seq);
Why we need this loop? I didn't use this loop in my test. Does it mean it is wrong for HSW architecture CPU?
Why we have to add pc->offset? I neen difference between rdpmc measurements. May I use just rdpmc return value as the counter value from last counter read?
I didn't understand why time_offset, time_mult, etc. variables here.
So, I just looking for the easy way to count simple CPU events from user space.
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't use the perf events subsystem to set up or manage the performance counters because it incurs a lot of overhead due to counter virtualization. The virtualization serves two purposes: (1) to count separately for each process, and (2) to expand the 48-bit hardware counters to 64 bits.
Virtualization by process can be overridden (e.g. "perf stat" has an option to count "globally"), but virtualization of the counters from 48-bit raw hardware registers to 64-bit virtualized registers cannot be overridden (as far as I know). To create a 64-bit virtual counter, the kernel reads the counters frequently (probably at the 1 millisecond Linux kernel scheduler interrupt, but it does not really matter), and adds the deltas to a 64-bit value that it keeps in memory. That is why you need a "read()" call -- this causes "perf" to read the counter again, compute the delta from the previous read, add the new delta to the 64-bit value in memory, and return the updated 64-bit count.
There is no way that this process can be fast. It is typically at least 500 cycles to get in and out of the kernel (with a very simple driver), and can be a lot more expensive. PAPI overheads are typically in excess of 2000 cycles to read a single counter. Part of that is in PAPI, but a large chunk is in the kernel access required by the underlying "perf events" substrate.
So I use a completely different approach. I program the counters explicitly using the "wrmsr.c" program from the "msrtools-1.2" package (either compiled as a standalone executable or with the important bits imported into my C program). This uses the /dev/cpu/*/msr interfaces which can be read/written by root, avoiding the need for yet another kernel module. These are high-overhead accesses (especially when run as a shell command), but they are only needed for setup, and allow me to use inline RDPMC calls to get raw hardware event counts without any overhead or confusion from virtualization.
There are a modest number of MSRs that must be set up correctly to use this approach, and care must be taken not to break any other process that might be using the performance counters. The MSRs that must be set up include MSR 0x38F IA32_PERF_GLOBAL_CTRL (in all cases) and MSR 0x38d IA32_FIXED_CTR_CTRL if you want to use the fixed-function counters, and of course the IA32_PERFEVTSEL* MSRs (0x186, 0x187, 0x188, 0x189 on all recent platforms, and 0x18A, 0x18B, 0x18C, and 0x18D on processors that support 8 counters (typically requires that HyperThreading be disabled). Some people recommend zeroing the counters, but I don't see any reason for that -- I just take differences between counts and add 2^48 if the counter has wrapped (once) so that the final value is smaller than the initial value. If the timing interval becomes "large", you need to be aware of how fast the counter might increment so that you can compute the longest interval between reads that guarantees no more than 2^48 increments (so you can unambiguously detect and correct wraparounds).
Sometimes you can get away with using RDPMC if "perf events" has programmed the counters, but only for very short intervals since you can't be completely sure when/if the counter has been reset by the kernel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The performance counters are complicated largely because the hardware is complicated, and secondarily because Intel does not want to expose microarchitectural implementation details without good reason. (Patent trolls can be quite creative at re-interpreting the patents that they own to claim that a big company is violating the patents -- but they need to have some idea of how the processor is implemented to make these claims.)
Some aspects of hardware performance counters probably need to be restricted to elevated privilege levels. For example, configuring the hardware performance counters to generate interrupts has the potential to severely impact system performance and usability. On the other hand, most of what the performance counters do is perfectly safe -- the vendors do a very good job of ensuring that programming random bits into the performance counter control registers is "safe" -- you may not be able to interpret the results, but the processor runs just fine.
I prefer that the hardware performance counters remain as low-level features, and not as registers that get saved and restored on context switches. (It would be hard to use the counters to measure things like context switch overhead if they were swapped in and out.) But leaving the counters as "raw" low-level features means that they cannot easily be shared, and it means that they provide a potentially high-bandwidth covert channel between processes.
In the high performance computing world where I work, systems are seldom time-shared, so we don't really need to worry about either sharing the counters or about covert channels. In the production environment a job is assigned a set of nodes and no other user is allowed access to those nodes for the duration of the job. The nodes are still shared between the OS (and all its subsidiary processes) and the user (and all the auxiliary processes that the user might cause to be started), but since this is the standard mode of operation, dealing with this sharing is part of the performance puzzle that we are trying to understand.
To use the hardware performance counters manually, a variety of tools are needed:
- For the hardware performance counters in the processor cores, I build the "rdmsr" and "wrmsr" command-line tools from "msrtools-1.2".
- I use a script to configure the global configuration registers and the PERFEVTSEL* MSRs for the programmable core counters.
- For whole-program measurements, I read the counters using the "rdmsr" program before and after the execution (taking care that the run is short enough that the counters can't be incremented more than 2^48 times during the run). You can also use "perf stat" for these sorts of measurements.
- For interval measurements inside the code, I program the counters using the script, then use the RDPMC command to read them at the desired locations in the code.
- For the "uncore" counters there are three different interfaces used, depending on the processor model:
- Some "uncore" counters use MSRs and can be configured using "wrmsr" as above. Unfortunately these can only be read from inside the kernel (since the RDMSR instruction can only be executed at ring 0). If the program is being run by root (or is owned by root and has the setuid bit set), then the program can open the /dev/cpu/*/msr device files and read or write the counters using pread() or pwrite() calls. These are kernel calls so they cost a few thousand cycles each, but there is nothing that can be done about this. (One thing that could help is to build a kernel module that could return multiple MSR values with a single call.)
- Some "uncore" counters are in "PCI configuration space". The root user can read/write these counters using the "setpci" command-line program. As with the MSR-based counters, a root user can open the device driver files (in /proc/bus/pci, I think) and read/write the counters using pread() and pwrite() commands (limited to 32-bit transactions).
- Some processors include "uncore" counters in a different range of memory-mapped IO space. Working with these is an advanced topic....
Here is a fairly typical script that I use to set up the counters (edited for clarity):
#!/bin/bash export NCORES=`cat /proc/cpuinfo | grep -c processor` echo "Number of cores is $NCORES" export MAXCORE=`expr $NCORES - 1` # Enable all counters in IA32_PERF_GLOBAL_CTRL # bits 34:32 enabled the three fixed function counters # bits 7:0 enable the eight programmable counters echo "Checking IA32_PERF_GLOBAL_CTRL on all cores" echo " (should be 00000007000000ff)" for core in `seq 0 $MAXCORE` do echo -n "$core " ~/bin/rdmsr -p $core -x -0 0x38f ~/bin/wrmsr -p $core 0x38f 0x00000007000000ff done # Core Performance Counter Event Select MSRs # Counter MSR # 0 0x186 # 1 0x187 # 2 0x188 # 3 0x189 # 4 0x18a # 5 0x18b # 6 0x18c # 7 0x18d # Dump all performance counter event select registers on all cores if [ 0 == 1 ] then echo "Printing out all performance counter event select registers" echo "MSR CORE CurrentValue" for PMC_MSR in 186 187 188 189 18a 18b 18c 18d do for CORE in `seq 0 $MAXCORE` do echo -n "$PMC_MSR $CORE " ~/bin/rdmsr -p $core -0 -x 0x${PMC_MSR} done done fi # Counter 0 Uops Dispatched on Port 0 0x004301a1 # Counter 1 Uops Dispatched on Port 1 0x004302a1 # Counter 2 Uops Dispatched on Port 2 0x004304a1 # Counter 3 Uops Dispatched on Port 3 0x004308a1 # Counter 4 actual core cycles unhalted 0x0043003c # Counter 5 Uops Dispatched on Port 5 0x004320a1 # Counter 6 cycles with no uops delivered from back end to # front end & there is no back end stall 0x0143019c # Counter 7 Uops issued from RAT to RS 0x0043010e echo "Programming counters 0,1,2,3" for core in `seq 0 $MAXCORE` do ~/bin/wrmsr -p $core 0x186 0x004301a1 ~/bin/wrmsr -p $core 0x187 0x004302a1 ~/bin/wrmsr -p $core 0x188 0x004304a1 ~/bin/wrmsr -p $core 0x189 0x004308a1 ~/bin/wrmsr -p $core 0x18a 0x0043003c ~/bin/wrmsr -p $core 0x18b 0x004320a1 ~/bin/wrmsr -p $core 0x18c 0x0143019c ~/bin/wrmsr -p $core 0x18d 0x0043010e done
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John, thank you for your answer. It is what I need. However, I meet some troubles when using your code. rdpmc_actual_cycles() works, but rdpmc_reference_cycles() and rdpmc_instructions() always return zero. You have mentioned that these counters may be not enabled by default, is it the reason why I get zeros ? How to enable the counters ?
John McCalpin wrote:
In recent Intel processors there are two ways to use the input argument for the RDPMC instruction.
Values of 0 to 3 (or 0 to 7) select one of the programmable performance counters.
Values of 2^30, 2^30+1, and 2^30+2 select one of the "fixed-function" performance counters. Documentation of this use is not very clear, and not particularly easy to find, so I usually just go back to my own code rather than trying to find it in the Intel documents.
The routines below provide access to each of the "fixed function" performance counter events with names that are easier to remember than the corresponding performance counter number.
Note that on some/many systems these fixed-function counters are either not enabled by default or they are enabled and in use by another process (sometimes the BIOS and sometimes the "NMI watchdog" process). If they are in use by another process they are probably configured to generate an interrupt on overflow, and the interrupt handler will reset the counter value every time. For example, the NMI watchdog on Linux systems often uses the "actual cycles" counter set up to overflow every 2 billion cycles (i.e., the counter is reset to (2^48-1 - 2^32) by the interrupt handler). In this case it is still perfectly safe to read the counter and it is still quite useful for measuring over short intervals (i.e., much less than 2 billion cycles) as long as you can do "sanity-checking" on the results and are able to discard the occasional results that are corrupted by the reset of the counter.
// rdpmc_instructions uses a "fixed-function" performance counter to return the count of retired instructions on // the current core in the low-order 48 bits of an unsigned 64-bit integer. unsigned long rdpmc_instructions() { unsigned a, d, c; c = (1<<30); __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); return ((unsigned long)a) | (((unsigned long)d) << 32);; } // rdpmc_actual_cycles uses a "fixed-function" performance counter to return the count of actual CPU core cycles // executed by the current core. Core cycles are not accumulated while the processor is in the "HALT" state, // which is used when the operating system has no task(s) to run on a processor core. unsigned long rdpmc_actual_cycles() { unsigned a, d, c; c = (1<<30)+1; __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); return ((unsigned long)a) | (((unsigned long)d) << 32);; } // rdpmc_reference_cycles uses a "fixed-function" performance counter to return the count of "reference" (or "nominal") // CPU core cycles executed by the current core. This counts at the same rate as the TSC, but does not count // when the core is in the "HALT" state. If a timed section of code shows a larger change in TSC than in // rdpmc_reference_cycles, the processor probably spent some time in a HALT state. unsigned long rdpmc_reference_cycles() { unsigned a, d, c; c = (1<<30)+2; __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); return ((unsigned long)a) | (((unsigned long)d) << 32);; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The fixed function counters are controlled by two MSRs.
- IA32_PERF_GLOBAL_CTRL (MSR 0x38F), bits 34:32 must be set to enable the three fixed-function counters.
- These are usually set by default, but it is a good idea to check anyway.
- IA32_FIXED_CTR_CTRL (MSR 0x38D), has 3 4-bit fields to control the fixed-function counters.
- For each counter, the bits are:
- Bit 0 enables counting in kernel mode
- Bit 1 enables counting in user mode
- Bit 2 enables counting for any thread running on the core in a system supporting more than one logical processor per physical core
- Bit 3 enables interrupts on overflow of this counter
- It is very common for the NMI watchdog to use one of these counters.
- If this is the case then one of the counters will have the "interrupt on overflow" bit enabled.
- A typical setting for this register is:
- 0x0b0
- Fixed Function Counter 2 is disabled (the high-order 4 bits are 0)
- Fixed Function Counter 1 is enabled for user and kernel counts, and has the interrupt on overflow bit set
- Fixed Function Counter 0 is disabled (the low-order 4 bits are 0)
- Disabling the NMI watchdog will typically clear the high-order bit of all three fields.
- Then you can write 0x333 to enable user+kernel mode for all three counters.
- You can still read the counter if the NMI watchdog is using it, but you need to be aware that the counter value will be reset after it overflows. A typical configuration is to set it to overflow every 2 billion cycles, so if your measurements are short, then this won't happen very often.
- For each counter, the bits are:
A very irritating feature of the Linux kernel is that the "perf stat" command (or similar) will sometimes use these fixed function counters and will disable them on exit. A rational piece of software would check the initial state and restore that state on exit -- but the Linux "perf events" subsystem is nothing like rational....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John, thank you again. BTW, I found that I can enable counters with perf_event interface, as shown by Thomas's code.
RDPMC works after the initialization code
void pmc_enable() { int i, err; struct perf_event_attr attr_inst, attr_rcyc; int perf_hw_inst, perf_hw_refcyc; long long result = 0; // Configure th event memset(&attr_inst, 0, sizeof(struct perf_event_attr)); attr_inst.type = PERF_TYPE_HARDWARE; attr_inst.size = sizeof(struct perf_event_attr); attr_inst.config = PERF_COUNT_HW_INSTRUCTIONS; attr_inst.inherit = 1; memset(&attr_rcyc, 0, sizeof(struct perf_event_attr)); attr_rcyc.type = PERF_TYPE_HARDWARE; attr_rcyc.size = sizeof(struct perf_event_attr); attr_rcyc.config = PERF_COUNT_HW_REF_CPU_CYCLES; attr_rcyc.inherit = 1; // Due to the setting of attr.inherit, it will also count all child perf_hw_inst = perf_event_open(&attr_inst, 0, -1, -1, 0); if (perf_hw_inst < 0) fprintf(stderr, "Failed to start HW_INSTRUCTIONS\n"); perf_hw_refcyc = perf_event_open(&attr_rcyc, 0, -1, -1, 0); if (perf_hw_refcyc < 0) fprintf(stderr, "Failed to start HW_REF_CPU_CYCCLES\n"); // Resetting counter to zero ioctl(perf_hw_inst, PERF_EVENT_IOC_RESET, 0); ioctl(perf_hw_refcyc, PERF_EVENT_IOC_RESET, 0); // Start counters ioctl(perf_hw_inst, PERF_EVENT_IOC_ENABLE, 0); ioctl(perf_hw_refcyc, PERF_EVENT_IOC_ENABLE, 0); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need to be careful using RDPMC together with with perf events, because the OS maintains its own idea of the correct count. To do this properly you need to mmap some space used by the kernel to store counter information and then follow a specific RDPMC code pattern. Andi Kleen's jevents code does this: https://github.com/andikleen/pmu-tools/tree/master/jevents
-- Larry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am using Dr. John Mccalpin's code using the RDPMC for reading the instructions and cycles on the Intel Knights Landing core.
#define rdpmc(counter,low,high) \ __asm__ __volatile__("rdpmc" \ : "=a" (low), "=d" (high) \ : "c" (counter)) #define NSAMPLES 1000 int main() { int j; unsigned long values64[NSAMPLES]; unsigned int fixed0, low, high; fixed0= (1<<30)+2; for (j=0; j<NSAMPLES; j++) values64= 0; // make sure array is in cache for (j=0; j<NSAMPLES; j+=8) { rdpmc(fixed0, low, high); values64 = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+1] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+2] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+3] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+4] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+5] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+6] = ((unsigned long) high << 32) + (unsigned long) low; rdpmc(fixed0, low, high); values64[j+7] = ((unsigned long) high << 32) + (unsigned long) low; } for (j=0; j<NSAMPLES; j++) printf(" %d %lu\n", j, values64 ); // make sure array is in cache }
I have experimented with different fixed0 values in the above code. fixed0= (1<<30)+1; fixed0= (1<<30); When fixed0=(1<<30), the output value (all the values of values64[]) is always 985, and when fixed0=(1<<30)+1, the output value is always 6041, and when fixed0=(1<<30)+2, the output value is always 0.
I have repeated ran the above code, and the numbers quoted above are consistently the same.
I am running the code from user level (with out sudo access), I doubt if the above numbers are correct. Is the actual register being read or not?
I expect the pmc_enable() approach using ioctl to have the system call overhead, given that I intend to count the instructions of simple code with say 100 instructions, this overhead is important to consider.
Please point out any issues with the above code? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need to read Chapter 18 of Volume 3 of the Intel Architectures Software Developer's Manual to understand the rest of the infrastructure that is required to use the performance counters.
The fixed-function performance counters are the easiest, but even they have two different MSRs that need to be set properly before they will increment.
- On most Linux systems the default setting of the IA32_PERF_GLOBAL_CTRL register (MSR 0x38f) enables the fixed-function counters (bits 32,33,34 are each set to 1), but this is may be overridden so it is important to check.
- On most Linux systems the default setting of the IA32_FIXED_CTR_CTRL register (MSR 0x38d) does *not* enable the fixed-function counters, or enables only one of them for use by the NMI Watchdog function.
- The bits in this register are described in Sections 18.2.2 and 18.2.3 of Volume 3 of the Intel Architectures Software Developer's Manual.
- If none of the counters are enabled by default, then they can all be enabled to increment in both user and kernel space by writing 0x333 to MSR 0x38d.
- Be aware that the "perf stat" (or "perf record") facilities assume that they can use the fixed-function counters (without checking to see if they are already in use), and the code stupidly disables the counters after it uses them.
- If bit 3, bit 7, or bit 11 of MSR 0x38d is set, then some process has set up the performance counters to generate an interrupt on overflow. This is usually the NMI Watchdog, but it could be used by other privileged processes (or by the BIOS). If the counter is enabled, but the "interrupt on overflow" bit is set, you can still use RDPMC to read the counters -- but you need to be aware that the counter will be reset every time it overflows. A commonly used approach is to re-set the counter to 2^48-2^32 so that it will overflow and generate an interrupt every 2^32 increments. If you are measuring over intervals that are "short" compared to 2^32 increments, then most of your differences will be OK, but if the counter is re-set during the interval you may get differences that look negative.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Dr. McCalpin.
The following is the code that seem to be working:
#include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <string.h> #include <sys/ioctl.h> #include <linux/perf_event.h> #include <asm/unistd.h> static long perf_event_open (struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags) { int ret; ret = syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags); return ret; } #define rdpmc(counter,low,high) \ __asm__ __volatile__("rdpmc" \ : "=a" (low), "=d" (high) \ : "c" (counter)) int main () { unsigned long values1, values2; unsigned int fixed0, low, high; struct perf_event_attr pe; int fd, i; fixed0 = (1 << 30); memset (&pe, 0, sizeof (struct perf_event_attr)); pe.type = PERF_TYPE_HARDWARE; pe.size = sizeof (struct perf_event_attr); pe.config = PERF_COUNT_HW_INSTRUCTIONS; pe.disabled = 1; pe.exclude_kernel = 0; pe.exclude_hv = 0; pe.exclude_idle = 0; fd = perf_event_open (&pe, 0, -1, -1, 0); if (fd == -1) { fprintf (stderr, "Error opening leader %llx\n", pe.config); exit (EXIT_FAILURE); } for (i=1; i<=50; i++) { ioctl (fd, PERF_EVENT_IOC_RESET, 0); ioctl (fd, PERF_EVENT_IOC_ENABLE, 0); rdpmc (fixed0, low, high); values1 = ((unsigned long) high << 32) + (unsigned long) low; //test () rdpmc (fixed0, low, high); values2 = ((unsigned long) high << 32) + (unsigned long) low; ioctl (fd, PERF_EVENT_IOC_DISABLE, 0); printf (" %lu\n", values2 ); } close (fd); }
i. How to measure two events (INSTRUCTIONS and REFERENCE_CYCLES) at a time using the rdpmc.
ii. If pe.exclude_kernel = 0;pe.exclude_hv = 0;pe.exclude_idle = 0; are used does for the measurement, does it account for all the instructions and cycles including OS daemons, interrupts etc.
iii. I have used the above code inside a MPI program, it appears to me that the measurements within the MPI processes are reasonable, do I need to pass any special input parameters to the perf_event_open().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have never used the perf events interface, so I don't know any of the details of how it is configuring the HW or SW.... (That is the primary reason why I don't use it -- it takes longer to figure out what it is measuring than it takes for me to set up exactly what I want manually.)
A couple of issues that may be related to what you are working on:
- "perf events" has hooks in the OS scheduler code so that it can save and restore the counter programming and the counter counts at context switches.
- "perf events" code also executes periodically (even without context switches) to read the counters and accumulate the deltas into a 64-bit "virtual counter" that won't overflow. This is probably integrated into the scheduler interrupt handler code, but it is possible to implement with an independent timer-based interrupt.
- "perf events" does not appear to have a 1:1 mapping between event names (e.g., PERF_COUNT_HW_INSTRUCTIONS) and hardware configuration. My understanding is that when an event can be counted by either a fixed-function counter or by a programmable counter, perf_events will use the fixed-function counter if it is not currently in use (e.g., by the NMI Watchdog), and will use a programmable counter if the corresponding fixed-function counter is busy. The events should give the same results using either counter interface, but this behavior makes it harder to understand which control registers are being modified.
- Both the fixed-function and programmable performance counters have a bit to enable counting in user mode and a bit to enable counting in kernel mode. These clearly work as intended at the large scale, but interrupt & exception handlers will have some instructions in user mode and some in kernel mode, so detailed counts may differ from expectations.
The fixed-function counters are independent of each other, so as long as they are enabled you can read any or all of them. Again, I don't know how to do this with the perf_events interface.
For MPI code you should not need to do anything special. MPI functions will have a combination of user-space and kernel-space activity, just like any other IO. Pinning the thread under observation to a single logical processor is almost always a good idea when using performance counters.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MPI_Barrier 1 1 0.0000105019
MPI_Barrier 2 1 0.0000945573
MPI_Barrier 3 1 0.0000133098
My aim is to understand why the high value of 0.00009455 occurred.
cycles:325
inst:196
cycles:2288
inst:201
cycles:416
inst:201
I can correlate the higher rdtscp value with the higher cycles (2288) values, however, there is no increase in the instructions reported.
This puzzles me to understand what is happening.
I used perf events with the options
pe.exclude_kernel = 0;
pe.exclude_hv = 0;
pe.exclude_idle = 0;
so, ideally it should account for all the kernel events.
However, the perf_event_open man page says
"
PERF_COUNT_HW_INSTRUCTIONS
Retired instructions. Be careful, these can be affected by various issues, most notably hardware
interrupt counts."
Does that mean that this counter does not account for hardware interrupts.
If perf_events is not the right interface, can you please suggest what should be interface I should use.
My goal is to identify the source of event that is causing the 2288 cycle latency.
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Applications running on multiple cores are going to show significant variability on any system, and it is not always possible to understand why specific cases ran slowly.
I recommend getting a thorough understanding of the statistics before deciding whether the slow result is worth paying attention to. Good statistics require at least several hundred measurements.
Your two "fast" MPI barrier measurements are fairly close to the lower limit of what is theoretically possible on a 2-socket system. The "slow" result is not very slow -- if any of the cores involved in the barrier take a timer interrupt, it is likely to take at least 2000 cycles. This can be expected to happen once every millisecond with typical OS configurations.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page