Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
3,265 Views

How to read performance counters by rdpmc instruction?

Jump to solution

Modern CPUs have quite a lot of performance counters, how to read them? I know many performance monitoring and profiler programs and libraries (PAPI, Vtune, profiler, linux_perf, etc.) but all these methods requires additional computation time (intrusiveness).

Also, modern Intel CPUs support rdpmc instruction and Linux OS (currently) support this instruction in user-level.

I would like to understand how to use GCC intrinsic to get CPU cycles and instruction executed to profile some function in my C code.

I understand I have to pin program execution to particular CPU core. Let’s assume the CPU is Haswell.

I appreciate for some small example of rdpmc usage.

For example, the code might looks like this

long long get_cycles(){
    unsigned int a=0, d=0;
    int ecx=(1<<30)+1; //What counter it selects?
    __asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
    return ((long long)a) | (((long long)d) << 32);
}

int main (int argc, char* argv[])
{
    long long start, finish;
    
    start = get_cycles();
    for (i = 0; i < 1000; i++)  {
        foo();
    }
    finish = get_cycles();
    printf("Total cycles : %ld\n",((double)(finish-start))/1000.0);
    return 0;
}

What ecx variable in get_cycles() must contain to provide CPU cycles and Instruction executed?

 

Thank you

0 Kudos
31 Replies
Highlighted
Black Belt
125 Views

Applications running on multiple cores are going to show significant variability on any system, and it is not always possible to understand why specific cases ran slowly.  

I recommend getting a thorough understanding of the statistics before deciding whether the slow result is worth paying attention to.   Good statistics require at least several hundred measurements. 

Your two "fast" MPI barrier measurements are fairly close to the lower limit of what is theoretically possible on a 2-socket system.  The "slow" result is not very slow -- if any of the cores involved in the barrier take a timer interrupt, it is likely to take at least 2000 cycles.   This can be expected to happen once every millisecond with typical OS configurations.

 

0 Kudos
Highlighted
Beginner
125 Views

>> Your two "fast" MPI barrier measurements are fairly close to the lower limit of what is theoretically possible on a 2-socket system.  The "slow" result is not very slow -- if any of the cores involved in the barrier take a timer interrupt, it is likely to take at least 2000 cycles.   This can be expected to happen once every millisecond with typical OS configurations.

The results I provided before are for 1 process MPI. So, assuming no process migration as I am binding the process to a core, may be the "slow" result is due to the timer interrupt on that single core. Yes, I did ran the experiment 5000 iterations, and just as a sample I have provided the timing for 3 iterations where "slow" timing occurred. I have measured the number of instructions and the cycles (ref cycles) for each iteration and the statistics are as below:

 

1 process MPI Barrier

            Freq.  #ofinstructions

  •      23  197
  •       7   198
  •     969 202
  •       1   203

            Freq.  #ofcycles

  •       3    273
  •     116  286
  •     557  299
  •     147  312
  •      53   325
  •      50   338
  •      16   351
  •       9   364
  •      12  377
  •      10  390
  •       5   403
  •       4   416
  •       1   429
  •       3   442
  •       4   455
  •       2   468
  •       1   494
  •       2   533
  •       1   611
  •       1   689
  •       1   845
  •       1   910
  •       1   2288

 

If timer interrupt is indeed the reason, it would be nice to have a way to verify this. Thank you! Even if we assume that 2288 cycles is due to the timer interrupt, it is not clear to why there is such a variation.

0 Kudos
Highlighted
Black Belt
125 Views

I am not sure what an MPI_Barrier() call is supposed to do if there is only one MPI task? 

The "number of instructions" range is very tight -- that is good news.

The "number of cycles" range is also fairly tight -- 92.3% of the results are in the range of 286 to 338 cycles, and >80% of the results are within 5% of 300 cycles, which is not bad at all. 

The sum of all of the times is about 313,000 cycles, which would be about 0.125 milliseconds with a 2.5 GHz TSC clock.   For this aggregate execution time you would not expect any timer interrupts with the standard 1 millisecond timer period, but seeing one is not particularly surprising.  Seeing only 1 slow (>2000 cycle) iteration out of the 1000 results shown is consistent with a random system interrupt.

The small (26 of 1000) number of results in the range of 500-1000 cycles are a bit harder to understand -- they look too fast to include an OS interrupt, but this can be very difficult to analyze in detail.   They are not a major contributor to the average latency -- excluding all of the results that took 400 cycles or more only reduces the average time by 2.2% (from 313 cycles to 306 cycles).   

 

0 Kudos
Highlighted
Beginner
125 Views

The run was on a Intel KNL core which has a 1.3 GHz TSC clock.

The small 26 out of 1000 are what puzzles me most, actually I am interested in understanding what is causing them. A similar variability was observed even with a simple matrix matrix multiplication code.

 

0 Kudos
Highlighted
Black Belt
125 Views

If the slow iterations are repeatable (in a statistical sense), then performance counters are the primary tool to work with.  

I would start by trying to rule out OS interference, then I would look for differences in cache behavior between the "typical" and "slow" cases.  It is not at all clear that KNL has enough performance counter events in the core+L1+L2 to be useful, and the limit of 2 counters per logical processor will make it very hard to be sure that the counts you are getting from different iterations are slow for the same reason(s).

I considered trying to use 4 threads bound to the 4 logical processors of one physical core to get access to all 8 counters, but Volume 1 of the Xeon Phi x200 Performance Monitoring Reference Manual (332972-001, section 1.2.1) says that "AnyThread" support is limited to the three architectural performance monitoring events provided by the fixed-function counters, so this approach is (apparently) useless for cache monitoring.

I have gotten some good results using the core's offcore response counters, but am still a bit confused about their scope -- some of the results I have seen are consistent with the counts being per-core, but most of the results I have seen are consistent with the counts being per-tile.

For data and/or coherence traffic outside the tile there are tons of counters, but I have not yet had a chance to evaluate them in any systematic fashion.

0 Kudos
Highlighted
Beginner
125 Views

I have difficulties in running RDPMC. Coud any one provide me code on how to enable CR4 to use RDPMC please?

0 Kudos
Highlighted
Beginner
125 Views

I appreciate if there is any sample code to serialize RDPMC? 

 

0 Kudos
Highlighted
Black Belt
125 Views

Setting the CR4.PCE bit requires kernel code, which is going to be different for each operating system.  In Linux operating systems starting with kernel 3.4, this bit is set using the "set_in_cr4()" function in the kernel by code in $KERNEL/source/arch/x86/kernel/cpu/perf_event.c

static int
x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
{
    // ... //

    case CPU_STARTING:
        if (x86_pmu.attr_rdpmc)
            set_in_cr4(X86_CR4_PCE);
    // ... //
}

"Serializing" RDPMC can mean a number of different things, depending on whether you want to enforce ordering on all instructions or on a class of instructions (such as memory references).  A common technique to force full serialization is to execute the CPUID instruction before the RDPMC instruction.  (You will need to copy the target PMC number back into %ecx after the CPUID instruction.)  If I recall correctly, the CPUID instruction is the only user-mode instruction that fully serializes a core, but the overhead is quite high -- somewhere in the range of 100 to 300 cycles on most processors.

If you only want to order the execution of the RDPMC instruction with respect to certain instructions, then you can create a false dependency using the inputs and outputs.  For example, if you want to read the performance counter after computing a value (or loading from memory), you can use the result as part of the computation of the value in %ecx.   The processor recognizes many idioms for zeroing registers, so you need to make this at least multi-step process.  For example, if you have computed a value (or loaded a value) into %r8, then you can ensure that the RDPMC instruction will not execute until after this value is computed by doing something like:

do whatever you need to do to compute the value in %r8

add the desired PMC number to the value in %r8 and save the sum in %r9

subtract the value in %r8 from the value in %r9 and save the result in %ecx

execute RDPMC

This only adds about 2 cycles and 2 instructions between the computation of the value in %r8 and the execution of the RDPMC instruction. 

Note that this will not prevent other instructions following the RDPMC instruction to be executed early --- either while %r8 is being computed or during the add/subtract cycles or concurrently with the RDPMC instruction.  It may be impossible to prevent this from happening without adding a fully serializing instruction like CPUID after the RDPMC.

Similar tricks can be used to force specific instructions to execute after the RDPMC -- just create a dependency between the output of the RDPMC (either %eax or %edx) and an input of the subsequent instruction(s).  In most cases you will need to perform something like the add/subtract trick to prevent the specific value in the RDPMC output from changing the results of your program.   Not all instructions have inputs (e.g., RDTSC), and it may not be practical to force a dependency on *all* subsequent instructions, but this approach can be a useful start.

My comments at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring... and the graph attached to that post may be helpful in understanding some of the issues.  That discussion was about RDTSC and RDTSCP, but many of the same ordering and overlap issues apply to RDPMC.

0 Kudos
Highlighted
Beginner
125 Views

__builtin_ia32_rdpmc (aka __rdpmc() in x86intrin.h) being treated as a pure function is finally fixed in GCC 6.5, 7.4+, 8.3+, and 9.x.

The nightly build of gcc pre9.0 trunk on Godbolt shows it does work properly now: https://godbolt.org/z/FUgmGe

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87550

0 Kudos
Highlighted
Black Belt
125 Views

I should note that some of my comments above about ordering are incorrect.  Although LFENCE was originally defined as a "load fence", it is now architecturally defined as an execution fence as well, and its overhead is much lower than the traditional approach of using CPUID.

This is discussed in the comments starting at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926549, and has also been included in my updated notes on timing short code sections http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processor...

0 Kudos
Highlighted
Beginner
85 Views

Dear dr bandwidth.

 

I would like to make use of RDPMC to read the L1 cache miss.

On my coffelake machine, it provides MEM_LOAD_RETIRED.L1_HIT and by measuring two performance counters for this flag, I think I can get whether my memory access his L1 or miss. I would like to use RDPMC directly on my code but don't know how to set ecx register for MEM_LOAD_RETIRED.L1_HIT event. Could you help  me how to configure ecx register of MEM_LOAD_RETIRED.L1_HIT event? 

0 Kudos
Highlighted
Black Belt
78 Views

Enabling and programming the core performance counters requires setting values in a number of MSRs.  The infrastructure is described in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual ("SWDM"), while Chapter 19 contains the tables of the specific "events" that each processor model can monitor.

On Linux systems, I use the "rdmsr" and "wrmsr" programs from msr-tools to read and write the MSRs.   This will require root privileges on any sane system.  It requires becoming very familiar with Chapter 18 of Vol3 of the SWDM, as well as exercising perfect control over process and thread placement during the measurement period.

A much better approach for getting started is using something like the Linux "perf stat" command for whole-program measurements, or (if you need to read the counters from inside the program) the PAPI or LIKWID libraries.