Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Capturing multiple events simultaneously using RDPMC instruction

Zirak
Beginner
3,218 Views

In modern CPUs, there are several special registers to capture ongoing events while a program or a piece of code is running in order to optimize software performance. Utilizing these register requires the knowledge of low level programming such as assembly language. However, exiting libraries have been developed to abstract this details such as PAPI. 

I tried to take samples (let say 10000 samples) from a piece of code using PAPI, I have noticed that there is a significant delay in executing the same code in case of injecting the same code with PAPI. This degrades the results. The reason of using PAPI is the capability of capturing multiple events simultaneously. Is it possible to capture three events (L1_DATA_CACHE_MISS, L2_MISS and L3_MISS) by executing three consecutive RDPMC instructions before and after a targeted code? Is there any other way of writing the following code to get accurate results? The code demonstrate the case.

 

unsigned long get_rdpmc(int event)
{
  int a=0, d=0;
  __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (event));

  return ((unsigned long)a) | (((unsigned long)d) << 32); 
}

int main(void)
{
  unsigned long before_L1_MISS, before_L2_MISS, before_L3_MISS;
  unsigned long after_L1_MISS, after_L2_MISS, after_L3_MISS;
  
  //sampling
  for(int i=0; i<100000; i++)
  {
    before_L1_MISS =get_rdpmc(L1_MISS);
    before_L2_MISS =get_rdpmc(L2_MISS);  
    before_L3_MISS =get_rdpmc(L3_MISS);

      foo();

    after_L1_MISS =get_rdpmc(L1_MISS);
    after_L2_MISS =get_rdpmc(L2_MISS);  
    after_L3_MISS =get_rdpmc(L3_MISS);
  }
  return 0;
}

 

0 Kudos
15 Replies
McCalpinJohn
Honored Contributor III
3,218 Views

The inline RDPMC instruction is the lowest-latency approach available.  This is a microcoded instruction, so it is not instantaneous, but it is reasonably fast -- something in the range of 24-40 cycles, depending on the processor and the event selected.  It has been a while since I looked at the overhead of repeated, back-to-back RDPMC instructions, but I don't think that there is a lot of overlap.  

If you want the lowest possible overhead, it is a good idea to make sure that the variables that you are writing into are dirty in the L1 Data Cache before you execute the RDPMC instructions.   Writing a zero to the 64-bit target address is enough, but you will want to check the assembly code to make sure that the compiler does not eliminate this store as "dead code".

When I am feeling particularly cranky about overheads, I modify the macro so that it only saves the low-order 32-bits of the result.   This saves a shift and OR, and it allows me to fit twice as many PMC values in the same fraction of the L1 Data Cache.  I have also created separate inline assembly macros for counters 0-7 so there won't be a chance of a memory operation to fetch the counter number, but I doubt this makes any difference in most use cases.

Given the lack of ordering guarantees for the RDPMC instruction, it is not clear that having lower overhead would actually make much difference.  See my discussion of the difference between the RDTSC and RDTSCP instructions (and the timing plot) at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/697093#comment-1886115 as an example of how tricky understanding an OOO processor can be at this level of granularity.

One place that I really enjoyed having a very low-latency RDTSC instruction was on Xeon Phi (first generation, Knights Corner).  That was an in-order core with a 5-cycle RDTSC latency.  This made it possible to measure the latency of individual load instructions (provided that you were very careful that when you stored the results you did not miss in the L1 Data Cache).

0 Kudos
Zirak
Beginner
3,218 Views

John,

Thank you the explanation. But the problem is, I am interested in other events rather than only # of cycles such as cache (L1,L2 and L3) misses, TLB misses, L2_LINES_IN, L2_LINES_OUT, L2_LD and etc. I need to test many of them to find the most efficient events to measure foo() function.

0 Kudos
McCalpinJohn
Honored Contributor III
3,218 Views

It is certainly common to want to review many core performance counter events -- there are at least hundreds of available events on most Intel processors if you include combinations of Umask values and cases where the Invert, Edge Detect, and/or CountMask features are relevant.  Intel processors support 2, 4, or 8 programmable performance counters per core, so if you want more events than that, you have to change the counter programming.

Unfortunately the Intel architecture makes it impossible to change the performance counter event select programming from user space at low latency/overhead.  Writing the PERFEVTSEL registers requires the WRMSR instruction, which can only be executed in kernel mode.   There are many papers out there showing PAPI overheads for these operations, and I think that PAPI includes a test program to measure these overheads as well.

Both PAPI and the underlying "perf events" infrastructure support more events than counters by time-slicing/multiplexing.   This is a great way to get lots of information about a long-running (i.e., minutes) code (unless you get very unlucky and have a strong correlation between the timing of changes in the characteristics of the application and the interval at which the performance counter event programming is changed), but multiplexing has increasing uncertainty and overhead for shorter measurement intervals.  The lowest overhead for multiplexing comes from implementing the performance counter save/change/restore code in the existing scheduling timer interrupt (which occurs every millisecond on most recent Linux systems).  I don't know if "perf events" piggybacks on this interrupt or if it schedules its own separate timer interrupts, but even in the best case you are looking at overheads in the range of 5000-10000 cycles (my recollection of the overhead of the scheduler interrupt) and millisecond granularity for multiplexing the counters.

If you have a very short section of code that does not have to be run in user space, the quickest way to test it with a wide variety of counters is to put it inside a loadable kernel module so that you can use WRMSR instructions directly.   You need to make sure that the kernel thread is pinned to a particular core so that you can use the "native_write_msr()" function (or inline assembly) instead of one of the cross-processor msr write calls, since those set up expensive interprocessor interrupts to make sure that the WRMSR instruction is run on the desired core.

0 Kudos
Zirak
Beginner
3,218 Views

Thank you John for the valuable information you provided. I have used msr (wrmsr and rdmsr instruction) to read specific events, but I am not sure which event requires to set Invert, Edge Detect, and/or CMask. I have not found a proper document to describe these bits and their relations with the existing events. Is there any source that demonstrate the granularity of using them please?

Regarding to the monitoring procedure. In case if I want to monitor a process A in user space in kernel module, how can I pin a processor/core, that process A  is currently utilizing it, to monitor process A activities such as MEM_LOAD_UOPS_RETIRED.L1_HIT.L1_HIT, MEM_LOAD_UOPS_RETIRED.L1_HIT.L1_MISS, MEM_UOPS_RETIRED.ALL_LOADS and etc? I have found from your sample code that you posted on (How to read performance counters by rdpmc instruction?) post, in which you used (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set)) with RDPMC instruction. Is that possible in case of using msr (wrmsr and rdmsr) instructions in kernel module as well to monitor independent process in user space?

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

#define FATAL(fmt,args...) do {                \
    ERROR(fmt, ##args);                        \
    exit(1);                                   \
  } while (0)

#define ERROR(fmt,args...) \
    fprintf(stderr, fmt, ##args)

#define rdpmc(counter,low,high) \
     __asm__ __volatile__("rdpmc" \
        : "=a" (low), "=d" (high) \
        : "c" (counter))

int cpu, nr_cpus;

void handle ( int sig )
{
  FATAL("cpu %d: caught %d\n", cpu, sig);
}

int main ( int argc, char *argv[] )
{
  nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
  for (cpu = 0; cpu < nr_cpus; cpu++) {

    pid_t pid = fork();
    if (pid == 0) {
      cpu_set_t cpu_set;
      CPU_ZERO(&cpu_set);
      CPU_SET(cpu, &cpu_set);
      if (sched_setaffinity(pid, sizeof(cpu_set), &cpu_set) < 0)
        FATAL("cannot set cpu affinity: %m\n");

      signal(SIGSEGV, &handle);

      unsigned int low, high;
      rdpmc(0, low, high);

      ERROR("cpu %d: low %u, high %u\n", cpu, low, high);
      break;
    }
  }

  return 0;
}

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,218 Views

The "Invert", "Edge Detect", and "Count Mask" bits are all sub-fields of the PERFEVTSEL registers, and are described in detail in Section 18.2 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384).   There are not any lists of events that specifically benefit from using these features, though a number of the event descriptions in the various machine-specific sections of Chapter 19 of Volume 3 of the SWDM make reference to using these features. 

  • Any of the events that can increment more than once per cycle can make use of the Count Mask feature, which can be set so that the counter increments if the number of increments for the base event is greater than or equal to the Count Mask value (which can be set from 1..255, with a Count Mask of 0 used interpreted as "don't use this feature").
    • Setting the "Invert" bit changes the "greater than or equal to" comparison to a "less than" comparison.
  • The "Invert" bit can also be used to reverse the sense of events that normally only increment by 1 per cycle, by setting the Invert bit and setting the Count Mask to 1.
  • Any of the events that count cycles in which a condition is true can use the Edge detect bit to count how often the condition transitions from "false" to "true".

As a rule of thumb, if Intel documents the use of Count Mask, Edge Detect, and Invert in the comments in Chapter 19, then the feature probably works as expected in that case.   For other cases where the features seem like they should work (but are not mentioned), some directed testing is probably a good idea.

There is one event that I can think of that uses the Counter Mask in a very non-intuitive way to allow the Umask values to select the "logical AND" of two conditions, rather than the "logical OR" that is the standard way that the Umask bits work.  This is Event 0xA3 "CYCLE_ACTIVITY.*", and I discuss the encoding at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733#comment-1790739

To monitor the performance counters for a specific user-mode process from the kernel, the easiest thing to do is bind the user-mode process to a specific logical processor and then use an interprocessor interrupt to run the RDMSR or WRMSR command(s) on the target logical processor.  In Linux this functionality is implemented in the "rdmsr_safe_on_cpu()" and "wrmsr_safe_on_cpu()" functions.  For my 3.10 kernel, these functions are defined in $KERNEL/source/arch/x86/lib/msr-smp.c, but it can take a while to track down the source code for all of the layers that get used to implement these functions.

0 Kudos
Zirak
Beginner
3,218 Views

John,

Thank you for the explanation. 

0 Kudos
Zirak
Beginner
3,218 Views

Johne,

Thank you in advance,

When I am running the following code in kernel module ring 0, it gives me the following results each sample is 20 rows, in total I have 11 runs. From run to run, it gives me different results as shown. My question is, is MSR instructions reading different core PMU with every new run or there are mistakes in the code?  Also, I've confused with real measurement.

  for (j=0;j<20;j++)
  {
  //prepare and reset counters
  write_msr(0x38f, 0x00, 0x00);
  write_msr(0xc1, 0x00, 0x00);
  write_msr(0xc2, 0x00, 0x00);
  write_msr(0xc3, 0x00, 0x00);
  write_msr(0xc4, 0x00, 0x00);
  write_msr(0x309, 0x00, 0x00);
  write_msr(0x30a, 0x00, 0x00);
  write_msr(0x30b, 0x00, 0x00);
  write_msr(0x186, 0x004301D1, 0x00);
  write_msr(0x187, 0x01c3010e, 0x00);
  //write_msr(0x188, 0x054305a3, 0x00);
  write_msr(0x189, 0x01c302b1, 0x00);
  
  write_msr(0x38d, 0x222, 0x00);
  write_msr(0x38f, 0x0f, 0x07);  
  
  
  for(i=0;i<100000000;i++)
    sum+=i;

  //reading counters
  write_msr(0x38f, 0x00, 0x00);
  write_msr(0x38d, 0x00, 0x00);   
  val1=read_msr(0xc1);  
  val2=read_msr(0xc2);      
  //val3=read_msr(0xc3);    
  val4=read_msr(0xc4);
  val5=read_msr(0x309);      
  val6=read_msr(0x30a);    
  val7=read_msr(0x30b); 
  
  
  printk(KERN_ALERT "AAAAA:    %7lld\t%7lld\t%7lld\t%7lld\t%7lld\t%7lld\n", val1,val2,val4,val5,val6,val7);
  
  
  }
  

Results: The first three colums are (MEM_LOAD_UOPS_RETIRED_L1_HIT, UOPS_ISSUED_ANY and STALL_CYCLES_CORE)     

 

RUN #1
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0
477           113     126       0       0       0

RUN #2
384           139     131       0       0       0
495           298     299       0       0       0
444           124     130       0       0       0
396           130     172       0       0       0
428           294     345       0       0       0
422           133     127       0       0       0
429           133     122       0       0       0
409           155     137       0       0       0
444           131     139       0       0       0
459           115     130       0       0       0
399           127     138       0       0       0
459           124     138       0       0       0
459           119     132       0       0       0
474           126     136       0       0       0
437           119     124       0       0       0
474           115     127       0       0       0
459           115     130       0       0       0
459           116     131       0       0       0
459           117     130       0       0       0
459           121     134       0       0       0

RUN #3
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0

RUN #4
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           134     145       0       0       0
414           127     147       0       0       0
414           126     135       0       0       0
459           116     129       0       0       0
402           141     137       0       0       0
459           117     130       0       0       0
459           117     130       0       0       0
459           116     127       0       0       0
459           119     126       0       0       0
459           119     130       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           120     133       0       0       0

RUN #5
459           116     163       0       0       0
462           129     173       0       0       0
462           123     168       0       0       0
487           137     182       0       0       0
459           117     164       0       0       0
459           116     163       0       0       0
469           260     305       0       0       0
472           235     275       0       0       0
462           121     166       0       0       0
472           152     194       0       0       0
472           162     207       0       0       0
459           120     165       0       0       0
459           114     163       0       0       0
459           117     164       0       0       0
462           144     185       0       0       0
459           116     162       0       0       0
462           220     265       0       0       0
459           116     163       0       0       0
402           396     436       0       0       0
477           130     175       0       0       0

RUN #6
402           160     203       0       0       0
422           150     192       0       0       0
534           151     197       0       0       0
384           152     192       0       0       0
392           153     193       0       0       0
399           156     196       0       0       0
414           161     204       0       0       0
502           167     210       0       0       0
447           162     132       0       0       0
377           149     190       0       0       0
402           155     197       0       0       0
402           162     137       0       0       0
492           164      57       0       0       0
417           164     206       0       0       0
409           160     201       0       0       0
429           154     196       0       0       0
402           162     205       0       0       0
517           161     203       0       0       0
399           152     140       0       0       0
514           163     205       0       0       0

RUN #7
459           116     163       0       0       0
459           115     129       0       0       0
444           121     132       0       0       0
474           138     146       0       0       0
437           120     165       0       0       0
459           116     163       0       0       0
489           117     132       0       0       0
399           133     176       0       0       0
459           117     164       0       0       0
459           116     129       0       0       0
459           116     163       0       0       0
429           126     123       0       0       0
459           119     133       0       0       0
459           118     132       0       0       0
459           119     165       0       0       0
459           116     129       0       0       0
474           123     138       0       0       0
504           133     147       0       0       0
462           127     172       0       0       0
474           144     189       0       0       0

RUN #8
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
459           116     129       0       0       0
399           143     166       0       0       0
444           124     133       0       0       0
384           147     129       0       0       0
459           121     126       0       0       0
462           118     131       0       0       0
459           120     134       0       0       0
459           119     133       0       0       0
444           137     126       0       0       0
459           116     129       0       0       0
459           119     127       0       0       0
414           127     132       0       0       0
459           122     137       0       0       0
459           126     173       0       0       0
474           136     155       0       0       0
392           138     132       0       0       0
474           143     159       0       0       0

RUN #9
457           170     179       0       0       0
459           119     127       0       0       0
472           161     174       0       0       0
459           117     130       0       0       0
459           118     131       0       0       0
487           175     181       0       0       0
472           215     220       0       0       0
459           117     130       0       0       0
462           129     142       0       0       0
462           125     135       0       0       0
472           161     172       0       0       0
459           120     129       0       0       0
454           173     174       0       0       0
457           190     193       0       0       0
459           116     131       0       0       0
459           122     131       0       0       0
462           146     154       0       0       0
437           124     131       0       0       0
459           116     128       0       0       0
529           266     281       0       0       0

RUN #10
459           120     128       0       0       0
472           133     139       0       0       0
484           137     147       0       0       0
444           128     136       0       0       0
462           128     140       0       0       0
462           130     142       0       0       0
459           118     130       0       0       0
459           116     129       0       0       0
469           156     162       0       0       0
459           123     133       0       0       0
462           135     145       0       0       0
477           159     166       0       0       0
459           127     140       0       0       0
459           123     133       0       0       0
479           294     298       0       0       0
447           129     138       0       0       0
477           218     219       0       0       0
472           159     171       0       0       0
469           182     194       0       0       0
472           125     137       0       0       0

RUN #11
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0
477           113     160       0       0       0

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,218 Views

I don't know how the kernel handles thread affinity for itself, so I don't know if a kernel process can be spontaneously moved in the middle of a run, or if the different runs are going to be executed on uncontrolled logical processors.   The Linux kernel typically uses interfaces like these prototyped in arch/x86/include/asm/msr.h to set up an interprocessor interrupt to ensure that the MSR is read on the desired target logical processor.  If you know that you have pinned the kernel thread to a single logical processor (assuming that is possible), then these calls should not be necessary.

#ifdef CONFIG_SMP
int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
int rdmsrl_on_cpu(unsigned int cpu, u32 msr_no, u64 *q);
int wrmsrl_on_cpu(unsigned int cpu, u32 msr_no, u64 q);
void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs);
void wrmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs);
int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
int wrmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
int rdmsrl_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 *q);
int wrmsrl_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 q);
int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]);
int wrmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]);
#else 

If you are running any Linux kernel that supports the RDTSCP instruction, you can use the low-order 12 bits of the %ecx register returned by the RDTSCP instruction to see what logical processor the process was running on when the RDTSCP instruction was executed.   Since this is entirely a user-space instruction, there is no reason for it to trigger a scheduling event (which might happen if you were to ask the OS where your process was running, for example).  In my (3.10) Linux kernels, this auxiliary register is set up in using the "write_rdtscp_aux()" function in the file arch/x86/kernel/vsyscall_64.c.  The low-order 12 bits of the aux register contain the (global) logical processor number, with the NUMA node number in the bits directly above the bottom 12.

0 Kudos
Zirak
Beginner
3,218 Views

John,

Thank you for your comments and suggestions.

I have tried to monitor a specific logical processor (as you explained in the last post), in which the targeted code in user space is assigned to, let say logical processo #3 (as shown in the following code). But the code in kernel module, which monitors the targeted code in user space, never gets a chance to run on the same logical processor (3) due to being busy by running the targeted code. I understand that inter-process interrupt (IPI) is used to provide communication between two processes or more. But in practice, I have confused in understanding how to pin a logical processor using IPI without context switching between two processes (one is the targeted code in user space and the other one is the monitor code in kernel module), because two processes cannot use the same logical processor simultaneously, unless one of them has to be switched to another logical processor. I have tried to find existing codes for this purpose, but unfortunately could not find. Could you refer me or provide me sample codes that uses "rdmsr_safe_on_cpu()" and "wrmsr_safe_on_cpu()" functions to pin kernel module to the targeted processor please?

//kernel module uses rdtscp instruction to identify the logical processor # that the module code is running on
#define get_rdtscp(lo, hi, pro) \
  __asm__ __volatile__("rdtscp" : "=a" (lo), "=d" (hi), "=c" (pro));


//targeted code in user space uses sched_setaffinity() to assigne to a certain logical processor 

  int pid=3;
  cpu_set_t mask;
  CPU_ZERO(&mask);
  CPU_SET(pid, &mask);
  sched_setaffinity(0, sizeof(mask), &mask);

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,218 Views

When you pin a user process to a logical processor, that does not prevent the OS from also running on that logical processor -- it simply prevents the user process from being run anywhere else when the kernel interrupts it.   When the kernel sends an interprocessor interrupt to the target core, it temporarily displaces the target task, and the interrupt handler is able to read the performance counters locally, then when the IPI is finished the scheduler will restart the target code on that logical processor.

This should work as long as the context-switching code does not include any fiddling with the performance counter registers.  If the context-switching code does save the target process's performance counter values as part of the context switch, then you don't need to read them again -- you just need to find the kernel structure where the context-switching code has saved them.  Some context-switching implementations save the current process's counter values and fill the PMC registers with something else (possibly zeros).  In this case you must find the values in the kernel data structure for that process, since once the interrupt handler starts, the counters no longer have useful information.  In other implementations the context-switch code reads the counter values and "freezes" the counters, but does not modify the count values.  In this case you can either look for the values in the kernel data structure for the process, or re-read the counter values.

I have seen lots of implementations of this infrastructure over the years in IRIX, AIX, and Linux, but the current "perf events" infrastructure in Linux has more layers than my brain is able to follow, so I don't really understand how it is implemented.  The source is all available -- I use the kernel source browser at http://lxr.free-electrons.com/source/kernel/ when I need to look at how things change across many kernel generations, but I prefer to work with a local copy of the Centos 7.2 kernel source that we use on our newer systems.    Understanding the kernel is very labor- and time-intensive, and I often give up before I am able to find the answers I am looking for....  

 

0 Kudos
Zirak
Beginner
3,218 Views

Thank you very much, this (http://lxr.free-electrons.com/source/kernel/) was really useful 

0 Kudos
Travis_D_
New Contributor II
3,218 Views

Two quick notes:

One option when you want to use more than the number of counters provided by the hardware is simply to run your code under test multiple times, rotating through a new set of counters each time until you have done them all. It takes longer than multiplexing, but avoids many of the issues multiplexing involves. Of course, this only works well if your load is highly repeatable (which usually means very small and deterministic).

Since you are interested in measuring counters with as low latency as possible, you might consider libpfc - which was made to do exactly that. You need to load an (included) kernel module which gives user-mode access to read/write the perf counters, but after that you can read the PMU from user-space with low latency (the 30ish cycles John mentioned above). It's open source and liberally licensed. As a tradeoff for the low latency and simplicity, it doesn't have any of the event multiplex, counter virtualization, etc, etc, that perf_events offers.

0 Kudos
Zirak
Beginner
3,218 Views

Thank you very much Travis for the interesting notes.

According to the first note. I agree, but what about if you want to monitor five-related events simultaneously? In your case, take four events together on the first test then take the fifth one separately on the second test, then combine both data. In some cases, you cannot guarantee to have the same values across multiple tests. Thus, the combination may does not give you accurate results as well.

According to the second note, yes I have found this too, it is really interesting.

Again thank you for your notes.

0 Kudos
Zirak
Beginner
3,218 Views

Travis 

This tool  libpfc is useful, if you want to pin any core in a CPU to be monitored by the kernel.

0 Kudos
Travis_D_
New Contributor II
3,218 Views

Zirak wrote:

Thank you very much Travis for the interesting notes.

According to the first note. I agree, but what about if you want to monitor five-related events simultaneously? In your case, take four events together on the first test then take the fifth one separately on the second test, then combine both data. In some cases, you cannot guarantee to have the same values across multiple tests. Thus, the combination may does not give you accurate results as well.

Yes, as I mentioned your load has to be deterministic (i.e., if it has some random behavior, make sure you seed generators the same seed, etc) for this to work. Still, this is one of the two best approaches, and both have their downsides: multiplexing may fail dramatically if the behavior of the code under test changes in which case multiplexing may "sample" it inaccurately. Repeated runs takes longer and may result in correlation of data between runs that had different behavior. A nice aspect of the latter, however, is that you can measure this - you can run the same counters on your application several time and observe their stability. So you know whether the approach is appropriate. With multiplexing you mostly have to hope.

0 Kudos
Reply