The accuracy of the performance counter statisitics - Page 2

Xin_X_1 · ‎08-07-2013

Hi ,

I am trying to play with the Intel performance counter monitor tool. I reuse some of its code and write a kernel module to read performance counter data. I basically follows the procedures in PCM::program() to set up the on core counters, and then use rdmsr wrmsr to read/write performance counters. I found that the data collected are not accurate when time between two read are small. For example, here are my procedures:

/* routines to start the counter of # of branch instructions, mimic PCM:program() code*/

/* routines to read the counter, using rdmsr and wrmsr*/

for ( i =0; i < 1000; ++i) arr = 1;

/* routines to read the counter again, using rdmsr and wrmsr*/

The number of branch instructions should be 1000, but the reading constantly shows about ~6500 (after - before). I am aware of that rdmsr has certain latency, probably 100+ cycles. But extra 5500-branch-instruction seems too large for 100+ cycles. I am not sure if this is because of my set up, or performance counters should not be used in this way? Can someone give me some suggestions? Thanks.

Sanjeev_D_ · ‎11-05-2015

Thank you for the kind reply.

I followed as said above, but I got another error.

   __asm {
       mov eax, 0x80000000 // bit 32 is set
       xor edx, edx // edx = 0
       mov ecx, 0x38F // IA32_PERF_GLOBAL_CTRL   msr
       wrmsr

       mov eax, 0x00000001 // only bit 0 is set, as we count in kernel space
       xor edx, edx
       mov ecx, 0x38D // IA32_FIXED_CTR_CTRL  msr
       wrmsr

       mov ecx, 0x309
       rdmsr
       mov lowvalue, eax
       mov highvalue, edx
   }

I got the following error with the fault at wrmsr.

ExceptionCode: c0000005 (Access violation)

FAULTING_SOURCE_CODE:
48:    __asm {
49:        mov eax, 0x80000000
50:        xor edx, edx
51:        mov ecx, 0x38F
> 52:        wrmsr
53:
54:        mov eax, 0x00000001
55:        xor edx, edx
56:        mov ecx, 0x38D
57:        wrmsr

McCalpinJohn · ‎11-05-2015

Your code is trying to set bit 31, not bit 32. (Bit addresses start with zero, not one). From the discussion in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual, only bits 0,1,2 and 32, 33, 34 are writable, the rest are reserved and the hardware does actually track which bits are writable.

Sanjeev_D_ · ‎01-05-2016

Dear John McCalpin,

Thank you for the reply. I have two queries.

1) I tried to configure the general performance counter to generate PMI interrupt after every N retired instructions.

For N=1001, I write -1000 in msr 0xc1. After 1001 retired instructions, the PMI interrupt has to occur due to overflow.
However, my implementation shows that the PMI interrupt is generated only once. Could you please let me know if my configuration is correct or if I am missing something ? Do I have to write -1000 in msr 0xc1 again while handling the interrupt ?

Following is my setup.

__asm {

   //IA32_PERF_GLOBAL_OVF_CTRL MSR
       xor edx, edx
       mov eax, 0x00000001
       mov ecx, 0x390
       wrmsr

   //IA32_PERF_GLOBAL_CTRL MSR address   0x38F
       xor edx, edx
       mov eax, 0x00000001
       mov ecx, 0x38F
       wrmsr

   //set -1000 as a overflow counter
       mov eax, -1000
       mov ecx, 0xc1
       wrmsr

       xor edx, edx
       mov eax, 0x005100C0
       mov ecx, 0x186
       wrmsr

       }

2) How can I configure the fixed performance counter to generate PMI interrupt after N retired instructions ?

Thank you in advance.

McCalpinJohn · ‎01-05-2016

I have never worked on the interrupt handlers for PMIs, but I do believe that they normally reset the counter to (MAXVAL - trip_count) before returning to the user code. Otherwise you would have to wait the full 2^48 increments before the next overflow. For retired instructions this will probably take longer than you are interested in waiting.....

Many Intel processors have limitations in what you are allowed to write to the programmable counter MSRs, so newer processors provide a "full-width" alias for each of these. E.g. for Counter 0, the counter that you normally read is MSR 0x31 (IA32_PMC0), but if you want to write more than the lower 32 bits you need to write to MSR 0x431 (IA32_A_PMC0). This is described in Section 18.2.5 of Volume 3 of the SW Developer's Guide.

To use one of the fixed function performance counters the procedure is almost identical. You write (MAXVAL - trip_count) to the IA32_FIXED_CTR0 counter MSR and set up the IA32_FIXED_CTR_CTRL MSR to enable overflows on that counter.

Zirak · ‎02-23-2017

Hello Patrick,

I really appreciate if you provide me a sample code how to serialize msr (RDMSR and WRMSR) instructions. I have found an article (How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures). In which the author explain serializing instructions (CPUID, RDTSC and RDTSCP) to read more accurate cycles of long loops. Is this can be applied on msr? How ever mfence and lfence instructions also have been used to reduce disturbances or noise. I am not sure which one might give accurate results?

Patrick Fay (Intel) wrote:

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

McCalpinJohn · ‎02-23-2017

As I described at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214#comment-1898483, "serializing" is complex subject, and requires that you state your requirements with a great deal of precision. In many cases the type of serialization that you think you want is simply not possible without huge (many hundreds of cycles) overheads.

Even when the definitions look fairly precise, further investigation often shows that there are cases that are not covered. For example the RDTSCP instruction is not allowed to "execute" until all prior instructions in program order have "executed". One problem is that "execution" is not an instantaneous event. Every instruction is pipelined to some degree, and instructions that access memory are in the "executing" state for anywhere between ~4 cycles and >1000 cycles. A more precise definition would require clarification between the times that an instruction "begins execution" and "completes execution". Even this may not be precise enough, since the results of "completing execution" become visible at different times to different functional units, depending on register bypass and/or cache bypass implementations. Does the definition say that the RDTSCP instruction cannot "begin" execution (which takes ~36 cycles) until all prior instructions have "completed" execution? Or does the definition say that the RDTSCP instruction cannot begin execution until such a point that the TSC value returned is guaranteed will point to a time no earlier than the latest cycle in which any prior instruction "completed execution". The definition of "execution" and the concepts of "before" and "after" become even fuzzier when you consider that the RDTSCP instruction is microcoded, executing about 22 uops, with a minimum repeat latency of 30 cycles or longer.

Fortunately for your current use case, the WRMSR instruction is listed as a serializing instruction in Section 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual. RDMSR is not a serializing instruction, but that may or may not matter, depending on what your specific requirements are. I should add that, like CPUID, WRMSR is a very slow instruction. I don't have recent timings, but if I recall correctly this was taking 100-200 cycles on a Xeon E3 (Sandy Bridge) processor. Because the MSR interface is an abstraction to a communication network that spans the entire chip, it seems likely that the latency of MSR reads and writes will vary depending on the core making the request and the physical location of the register being accessed. For MSRs with thread scope or core scope the RDMSR and WRMSR instructions can only access the *local* copies. If you want to read/write MSRs associated with a different logical processor, you will need to set up an interprocessor interrupt, which probably has a cost of a few thousand cycles.