- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In modern CPUs, there are several special registers to capture ongoing events while a program or a piece of code is running in order to optimize software performance. Utilizing these register requires the knowledge of low level programming such as assembly language. However, exiting libraries have been developed to abstract this details such as PAPI.
I tried to take samples (let say 10000 samples) from a piece of code using PAPI, I have noticed that there is a significant delay in executing the same code in case of injecting the same code with PAPI. This degrades the results. The reason of using PAPI is the capability of capturing multiple events simultaneously. Is it possible to capture three events (L1_DATA_CACHE_MISS, L2_MISS and L3_MISS) by executing three consecutive RDPMC instructions before and after a targeted code? Is there any other way of writing the following code to get accurate results? The code demonstrate the case.
unsigned long get_rdpmc(int event) { int a=0, d=0; __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (event)); return ((unsigned long)a) | (((unsigned long)d) << 32); } int main(void) { unsigned long before_L1_MISS, before_L2_MISS, before_L3_MISS; unsigned long after_L1_MISS, after_L2_MISS, after_L3_MISS; //sampling for(int i=0; i<100000; i++) { before_L1_MISS =get_rdpmc(L1_MISS); before_L2_MISS =get_rdpmc(L2_MISS); before_L3_MISS =get_rdpmc(L3_MISS); foo(); after_L1_MISS =get_rdpmc(L1_MISS); after_L2_MISS =get_rdpmc(L2_MISS); after_L3_MISS =get_rdpmc(L3_MISS); } return 0; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The inline RDPMC instruction is the lowest-latency approach available. This is a microcoded instruction, so it is not instantaneous, but it is reasonably fast -- something in the range of 24-40 cycles, depending on the processor and the event selected. It has been a while since I looked at the overhead of repeated, back-to-back RDPMC instructions, but I don't think that there is a lot of overlap.
If you want the lowest possible overhead, it is a good idea to make sure that the variables that you are writing into are dirty in the L1 Data Cache before you execute the RDPMC instructions. Writing a zero to the 64-bit target address is enough, but you will want to check the assembly code to make sure that the compiler does not eliminate this store as "dead code".
When I am feeling particularly cranky about overheads, I modify the macro so that it only saves the low-order 32-bits of the result. This saves a shift and OR, and it allows me to fit twice as many PMC values in the same fraction of the L1 Data Cache. I have also created separate inline assembly macros for counters 0-7 so there won't be a chance of a memory operation to fetch the counter number, but I doubt this makes any difference in most use cases.
Given the lack of ordering guarantees for the RDPMC instruction, it is not clear that having lower overhead would actually make much difference. See my discussion of the difference between the RDTSC and RDTSCP instructions (and the timing plot) at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/697093#comment-1886115 as an example of how tricky understanding an OOO processor can be at this level of granularity.
One place that I really enjoyed having a very low-latency RDTSC instruction was on Xeon Phi (first generation, Knights Corner). That was an in-order core with a 5-cycle RDTSC latency. This made it possible to measure the latency of individual load instructions (provided that you were very careful that when you stored the results you did not miss in the L1 Data Cache).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Thank you the explanation. But the problem is, I am interested in other events rather than only # of cycles such as cache (L1,L2 and L3) misses, TLB misses, L2_LINES_IN, L2_LINES_OUT, L2_LD and etc. I need to test many of them to find the most efficient events to measure foo() function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is certainly common to want to review many core performance counter events -- there are at least hundreds of available events on most Intel processors if you include combinations of Umask values and cases where the Invert, Edge Detect, and/or CountMask features are relevant. Intel processors support 2, 4, or 8 programmable performance counters per core, so if you want more events than that, you have to change the counter programming.
Unfortunately the Intel architecture makes it impossible to change the performance counter event select programming from user space at low latency/overhead. Writing the PERFEVTSEL registers requires the WRMSR instruction, which can only be executed in kernel mode. There are many papers out there showing PAPI overheads for these operations, and I think that PAPI includes a test program to measure these overheads as well.
Both PAPI and the underlying "perf events" infrastructure support more events than counters by time-slicing/multiplexing. This is a great way to get lots of information about a long-running (i.e., minutes) code (unless you get very unlucky and have a strong correlation between the timing of changes in the characteristics of the application and the interval at which the performance counter event programming is changed), but multiplexing has increasing uncertainty and overhead for shorter measurement intervals. The lowest overhead for multiplexing comes from implementing the performance counter save/change/restore code in the existing scheduling timer interrupt (which occurs every millisecond on most recent Linux systems). I don't know if "perf events" piggybacks on this interrupt or if it schedules its own separate timer interrupts, but even in the best case you are looking at overheads in the range of 5000-10000 cycles (my recollection of the overhead of the scheduler interrupt) and millisecond granularity for multiplexing the counters.
If you have a very short section of code that does not have to be run in user space, the quickest way to test it with a wide variety of counters is to put it inside a loadable kernel module so that you can use WRMSR instructions directly. You need to make sure that the kernel thread is pinned to a particular core so that you can use the "native_write_msr()" function (or inline assembly) instead of one of the cross-processor msr write calls, since those set up expensive interprocessor interrupts to make sure that the WRMSR instruction is run on the desired core.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you John for the valuable information you provided. I have used msr (wrmsr and rdmsr instruction) to read specific events, but I am not sure which event requires to set Invert, Edge Detect, and/or CMask. I have not found a proper document to describe these bits and their relations with the existing events. Is there any source that demonstrate the granularity of using them please?
Regarding to the monitoring procedure. In case if I want to monitor a process A in user space in kernel module, how can I pin a processor/core, that process A is currently utilizing it, to monitor process A activities such as MEM_LOAD_UOPS_RETIRED.L1_HIT.L1_HIT, MEM_LOAD_UOPS_RETIRED.L1_HIT.L1_MISS, MEM_UOPS_RETIRED.ALL_LOADS and etc? I have found from your sample code that you posted on (How to read performance counters by rdpmc instruction?) post, in which you used (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set)) with RDPMC instruction. Is that possible in case of using msr (wrmsr and rdmsr) instructions in kernel module as well to monitor independent process in user space?
#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <signal.h> #include <sched.h> #include <string.h> #include <errno.h> #define FATAL(fmt,args...) do { \ ERROR(fmt, ##args); \ exit(1); \ } while (0) #define ERROR(fmt,args...) \ fprintf(stderr, fmt, ##args) #define rdpmc(counter,low,high) \ __asm__ __volatile__("rdpmc" \ : "=a" (low), "=d" (high) \ : "c" (counter)) int cpu, nr_cpus; void handle ( int sig ) { FATAL("cpu %d: caught %d\n", cpu, sig); } int main ( int argc, char *argv[] ) { nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); for (cpu = 0; cpu < nr_cpus; cpu++) { pid_t pid = fork(); if (pid == 0) { cpu_set_t cpu_set; CPU_ZERO(&cpu_set); CPU_SET(cpu, &cpu_set); if (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set) < 0) FATAL("cannot set cpu affinity: %m\n"); signal(SIGSEGV, &handle); unsigned int low, high; rdpmc(0, low, high); ERROR("cpu %d: low %u, high %u\n", cpu, low, high); break; } } return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The "Invert", "Edge Detect", and "Count Mask" bits are all sub-fields of the PERFEVTSEL registers, and are described in detail in Section 18.2 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384). There are not any lists of events that specifically benefit from using these features, though a number of the event descriptions in the various machine-specific sections of Chapter 19 of Volume 3 of the SWDM make reference to using these features.
- Any of the events that can increment more than once per cycle can make use of the Count Mask feature, which can be set so that the counter increments if the number of increments for the base event is greater than or equal to the Count Mask value (which can be set from 1..255, with a Count Mask of 0 used interpreted as "don't use this feature").
- Setting the "Invert" bit changes the "greater than or equal to" comparison to a "less than" comparison.
- The "Invert" bit can also be used to reverse the sense of events that normally only increment by 1 per cycle, by setting the Invert bit and setting the Count Mask to 1.
- Any of the events that count cycles in which a condition is true can use the Edge detect bit to count how often the condition transitions from "false" to "true".
As a rule of thumb, if Intel documents the use of Count Mask, Edge Detect, and Invert in the comments in Chapter 19, then the feature probably works as expected in that case. For other cases where the features seem like they should work (but are not mentioned), some directed testing is probably a good idea.
There is one event that I can think of that uses the Counter Mask in a very non-intuitive way to allow the Umask values to select the "logical AND" of two conditions, rather than the "logical OR" that is the standard way that the Umask bits work. This is Event 0xA3 "CYCLE_ACTIVITY.*", and I discuss the encoding at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733#comment-1790739
To monitor the performance counters for a specific user-mode process from the kernel, the easiest thing to do is bind the user-mode process to a specific logical processor and then use an interprocessor interrupt to run the RDMSR or WRMSR command(s) on the target logical processor. In Linux this functionality is implemented in the "rdmsr_safe_on_cpu()" and "wrmsr_safe_on_cpu()" functions. For my 3.10 kernel, these functions are defined in $KERNEL/source/arch/x86/lib/msr-smp.c, but it can take a while to track down the source code for all of the layers that get used to implement these functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Thank you for the explanation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Johne,
Thank you in advance,
When I am running the following code in kernel module ring 0, it gives me the following results each sample is 20 rows, in total I have 11 runs. From run to run, it gives me different results as shown. My question is, is MSR instructions reading different core PMU with every new run or there are mistakes in the code? Also, I've confused with real measurement.
for (j=0;j<20;j++) { //prepare and reset counters write_msr(0x38f, 0x00, 0x00); write_msr(0xc1, 0x00, 0x00); write_msr(0xc2, 0x00, 0x00); write_msr(0xc3, 0x00, 0x00); write_msr(0xc4, 0x00, 0x00); write_msr(0x309, 0x00, 0x00); write_msr(0x30a, 0x00, 0x00); write_msr(0x30b, 0x00, 0x00); write_msr(0x186, 0x004301D1, 0x00); write_msr(0x187, 0x01c3010e, 0x00); //write_msr(0x188, 0x054305a3, 0x00); write_msr(0x189, 0x01c302b1, 0x00); write_msr(0x38d, 0x222, 0x00); write_msr(0x38f, 0x0f, 0x07); for(i=0;i<100000000;i++) sum+=i; //reading counters write_msr(0x38f, 0x00, 0x00); write_msr(0x38d, 0x00, 0x00); val1=read_msr(0xc1); val2=read_msr(0xc2); //val3=read_msr(0xc3); val4=read_msr(0xc4); val5=read_msr(0x309); val6=read_msr(0x30a); val7=read_msr(0x30b); printk(KERN_ALERT "AAAAA: %7lld\t%7lld\t%7lld\t%7lld\t%7lld\t%7lld\n", val1,val2,val4,val5,val6,val7); }
Results: The first three colums are (MEM_LOAD_UOPS_RETIRED_L1_HIT, UOPS_ISSUED_ANY and STALL_CYCLES_CORE)
RUN #1 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 477 113 126 0 0 0 RUN #2 384 139 131 0 0 0 495 298 299 0 0 0 444 124 130 0 0 0 396 130 172 0 0 0 428 294 345 0 0 0 422 133 127 0 0 0 429 133 122 0 0 0 409 155 137 0 0 0 444 131 139 0 0 0 459 115 130 0 0 0 399 127 138 0 0 0 459 124 138 0 0 0 459 119 132 0 0 0 474 126 136 0 0 0 437 119 124 0 0 0 474 115 127 0 0 0 459 115 130 0 0 0 459 116 131 0 0 0 459 117 130 0 0 0 459 121 134 0 0 0 RUN #3 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 RUN #4 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 134 145 0 0 0 414 127 147 0 0 0 414 126 135 0 0 0 459 116 129 0 0 0 402 141 137 0 0 0 459 117 130 0 0 0 459 117 130 0 0 0 459 116 127 0 0 0 459 119 126 0 0 0 459 119 130 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 120 133 0 0 0 RUN #5 459 116 163 0 0 0 462 129 173 0 0 0 462 123 168 0 0 0 487 137 182 0 0 0 459 117 164 0 0 0 459 116 163 0 0 0 469 260 305 0 0 0 472 235 275 0 0 0 462 121 166 0 0 0 472 152 194 0 0 0 472 162 207 0 0 0 459 120 165 0 0 0 459 114 163 0 0 0 459 117 164 0 0 0 462 144 185 0 0 0 459 116 162 0 0 0 462 220 265 0 0 0 459 116 163 0 0 0 402 396 436 0 0 0 477 130 175 0 0 0 RUN #6 402 160 203 0 0 0 422 150 192 0 0 0 534 151 197 0 0 0 384 152 192 0 0 0 392 153 193 0 0 0 399 156 196 0 0 0 414 161 204 0 0 0 502 167 210 0 0 0 447 162 132 0 0 0 377 149 190 0 0 0 402 155 197 0 0 0 402 162 137 0 0 0 492 164 57 0 0 0 417 164 206 0 0 0 409 160 201 0 0 0 429 154 196 0 0 0 402 162 205 0 0 0 517 161 203 0 0 0 399 152 140 0 0 0 514 163 205 0 0 0 RUN #7 459 116 163 0 0 0 459 115 129 0 0 0 444 121 132 0 0 0 474 138 146 0 0 0 437 120 165 0 0 0 459 116 163 0 0 0 489 117 132 0 0 0 399 133 176 0 0 0 459 117 164 0 0 0 459 116 129 0 0 0 459 116 163 0 0 0 429 126 123 0 0 0 459 119 133 0 0 0 459 118 132 0 0 0 459 119 165 0 0 0 459 116 129 0 0 0 474 123 138 0 0 0 504 133 147 0 0 0 462 127 172 0 0 0 474 144 189 0 0 0 RUN #8 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 459 116 129 0 0 0 399 143 166 0 0 0 444 124 133 0 0 0 384 147 129 0 0 0 459 121 126 0 0 0 462 118 131 0 0 0 459 120 134 0 0 0 459 119 133 0 0 0 444 137 126 0 0 0 459 116 129 0 0 0 459 119 127 0 0 0 414 127 132 0 0 0 459 122 137 0 0 0 459 126 173 0 0 0 474 136 155 0 0 0 392 138 132 0 0 0 474 143 159 0 0 0 RUN #9 457 170 179 0 0 0 459 119 127 0 0 0 472 161 174 0 0 0 459 117 130 0 0 0 459 118 131 0 0 0 487 175 181 0 0 0 472 215 220 0 0 0 459 117 130 0 0 0 462 129 142 0 0 0 462 125 135 0 0 0 472 161 172 0 0 0 459 120 129 0 0 0 454 173 174 0 0 0 457 190 193 0 0 0 459 116 131 0 0 0 459 122 131 0 0 0 462 146 154 0 0 0 437 124 131 0 0 0 459 116 128 0 0 0 529 266 281 0 0 0 RUN #10 459 120 128 0 0 0 472 133 139 0 0 0 484 137 147 0 0 0 444 128 136 0 0 0 462 128 140 0 0 0 462 130 142 0 0 0 459 118 130 0 0 0 459 116 129 0 0 0 469 156 162 0 0 0 459 123 133 0 0 0 462 135 145 0 0 0 477 159 166 0 0 0 459 127 140 0 0 0 459 123 133 0 0 0 479 294 298 0 0 0 447 129 138 0 0 0 477 218 219 0 0 0 472 159 171 0 0 0 469 182 194 0 0 0 472 125 137 0 0 0 RUN #11 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0 477 113 160 0 0 0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know how the kernel handles thread affinity for itself, so I don't know if a kernel process can be spontaneously moved in the middle of a run, or if the different runs are going to be executed on uncontrolled logical processors. The Linux kernel typically uses interfaces like these prototyped in arch/x86/include/asm/msr.h to set up an interprocessor interrupt to ensure that the MSR is read on the desired target logical processor. If you know that you have pinned the kernel thread to a single logical processor (assuming that is possible), then these calls should not be necessary.
#ifdef CONFIG_SMP int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h); int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h); int rdmsrl_on_cpu(unsigned int cpu, u32 msr_no, u64 *q); int wrmsrl_on_cpu(unsigned int cpu, u32 msr_no, u64 q); void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs); void wrmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs); int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h); int wrmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h); int rdmsrl_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 *q); int wrmsrl_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 q); int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]); int wrmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]); #else
If you are running any Linux kernel that supports the RDTSCP instruction, you can use the low-order 12 bits of the %ecx register returned by the RDTSCP instruction to see what logical processor the process was running on when the RDTSCP instruction was executed. Since this is entirely a user-space instruction, there is no reason for it to trigger a scheduling event (which might happen if you were to ask the OS where your process was running, for example). In my (3.10) Linux kernels, this auxiliary register is set up in using the "write_rdtscp_aux()" function in the file arch/x86/kernel/vsyscall_64.c. The low-order 12 bits of the aux register contain the (global) logical processor number, with the NUMA node number in the bits directly above the bottom 12.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Thank you for your comments and suggestions.
I have tried to monitor a specific logical processor (as you explained in the last post), in which the targeted code in user space is assigned to, let say logical processo #3 (as shown in the following code). But the code in kernel module, which monitors the targeted code in user space, never gets a chance to run on the same logical processor (3) due to being busy by running the targeted code. I understand that inter-process interrupt (IPI) is used to provide communication between two processes or more. But in practice, I have confused in understanding how to pin a logical processor using IPI without context switching between two processes (one is the targeted code in user space and the other one is the monitor code in kernel module), because two processes cannot use the same logical processor simultaneously, unless one of them has to be switched to another logical processor. I have tried to find existing codes for this purpose, but unfortunately could not find. Could you refer me or provide me sample codes that uses "rdmsr_safe_on_cpu()" and "wrmsr_safe_on_cpu()" functions to pin kernel module to the targeted processor please?
//kernel module uses rdtscp instruction to identify the logical processor # that the module code is running on #define get_rdtscp(lo, hi, pro) \ __asm__ __volatile__("rdtscp" : "=a" (lo), "=d" (hi), "=c" (pro)); //targeted code in user space uses sched_setaffinity() to assigne to a certain logical processor int pid=3; cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(pid, &mask); sched_setaffinity(0, sizeof(mask), &mask);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you pin a user process to a logical processor, that does not prevent the OS from also running on that logical processor -- it simply prevents the user process from being run anywhere else when the kernel interrupts it. When the kernel sends an interprocessor interrupt to the target core, it temporarily displaces the target task, and the interrupt handler is able to read the performance counters locally, then when the IPI is finished the scheduler will restart the target code on that logical processor.
This should work as long as the context-switching code does not include any fiddling with the performance counter registers. If the context-switching code does save the target process's performance counter values as part of the context switch, then you don't need to read them again -- you just need to find the kernel structure where the context-switching code has saved them. Some context-switching implementations save the current process's counter values and fill the PMC registers with something else (possibly zeros). In this case you must find the values in the kernel data structure for that process, since once the interrupt handler starts, the counters no longer have useful information. In other implementations the context-switch code reads the counter values and "freezes" the counters, but does not modify the count values. In this case you can either look for the values in the kernel data structure for the process, or re-read the counter values.
I have seen lots of implementations of this infrastructure over the years in IRIX, AIX, and Linux, but the current "perf events" infrastructure in Linux has more layers than my brain is able to follow, so I don't really understand how it is implemented. The source is all available -- I use the kernel source browser at http://lxr.free-electrons.com/source/kernel/ when I need to look at how things change across many kernel generations, but I prefer to work with a local copy of the Centos 7.2 kernel source that we use on our newer systems. Understanding the kernel is very labor- and time-intensive, and I often give up before I am able to find the answers I am looking for....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Two quick notes:
One option when you want to use more than the number of counters provided by the hardware is simply to run your code under test multiple times, rotating through a new set of counters each time until you have done them all. It takes longer than multiplexing, but avoids many of the issues multiplexing involves. Of course, this only works well if your load is highly repeatable (which usually means very small and deterministic).
Since you are interested in measuring counters with as low latency as possible, you might consider libpfc - which was made to do exactly that. You need to load an (included) kernel module which gives user-mode access to read/write the perf counters, but after that you can read the PMU from user-space with low latency (the 30ish cycles John mentioned above). It's open source and liberally licensed. As a tradeoff for the low latency and simplicity, it doesn't have any of the event multiplex, counter virtualization, etc, etc, that perf_events offers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much Travis for the interesting notes.
According to the first note. I agree, but what about if you want to monitor five-related events simultaneously? In your case, take four events together on the first test then take the fifth one separately on the second test, then combine both data. In some cases, you cannot guarantee to have the same values across multiple tests. Thus, the combination may does not give you accurate results as well.
According to the second note, yes I have found this too, it is really interesting.
Again thank you for your notes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Travis
This tool libpfc is useful, if you want to pin any core in a CPU to be monitored by the kernel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zirak wrote:
Thank you very much Travis for the interesting notes.
According to the first note. I agree, but what about if you want to monitor five-related events simultaneously? In your case, take four events together on the first test then take the fifth one separately on the second test, then combine both data. In some cases, you cannot guarantee to have the same values across multiple tests. Thus, the combination may does not give you accurate results as well.
Yes, as I mentioned your load has to be deterministic (i.e., if it has some random behavior, make sure you seed generators the same seed, etc) for this to work. Still, this is one of the two best approaches, and both have their downsides: multiplexing may fail dramatically if the behavior of the code under test changes in which case multiplexing may "sample" it inaccurately. Repeated runs takes longer and may result in correlation of data between runs that had different behavior. A nice aspect of the latter, however, is that you can measure this - you can run the same counters on your application several time and observe their stability. So you know whether the approach is appropriate. With multiplexing you mostly have to hope.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page