Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

The accuracy of the performance counter statisitics

Xin_X_1
Beginner
2,277 Views

Hi ,

I am trying to play with the Intel performance counter monitor tool. I reuse some of its code and write a kernel module to read performance counter data. I basically follows the procedures in PCM::program() to set up the on core counters, and then use rdmsr wrmsr to read/write performance counters. I found that the data collected are not accurate when time between two read are small. For example, here are my procedures:

/* routines to start the counter of # of branch instructions, mimic PCM:program() code*/

/* routines to read the counter, using rdmsr and wrmsr*/

for ( i =0; i < 1000; ++i) arr = 1;

/* routines to read the counter again, using rdmsr and wrmsr*/

The number of branch instructions should be 1000, but the reading constantly shows about ~6500 (after - before). I am aware of that rdmsr has certain latency, probably 100+ cycles. But extra 5500-branch-instruction seems too large for 100+ cycles. I am not sure if this is because of my set up, or performance counters should not be used in this way? Can someone give me some suggestions?  Thanks.

0 Kudos
26 Replies
Sanjeev_D_
Beginner
389 Views

Thank you for the kind reply. 

I followed as said above, but I got another error.

    __asm {    
        mov eax, 0x80000000       // bit 32 is set
        xor edx, edx                       // edx = 0 
        mov ecx, 0x38F                 //   IA32_PERF_GLOBAL_CTRL   msr
        wrmsr
        
        mov eax, 0x00000001       // only bit 0 is set, as we count in kernel space            
        xor edx, edx                       
        mov ecx, 0x38D                     // IA32_FIXED_CTR_CTRL  msr
        wrmsr
            
        mov ecx, 0x309
        rdmsr
        mov lowvalue, eax
        mov highvalue, edx
    }    

I got the following error with the fault at wrmsr.

  ExceptionCode: c0000005 (Access violation)

 


 FAULTING_SOURCE_CODE:  
    48:     __asm {    
    49:         mov eax, 0x80000000                  
    50:         xor edx, edx                       
    51:         mov ecx, 0x38F                     
>   52:         wrmsr
    53:         
    54:         mov eax, 0x00000001                   
    55:         xor edx, edx                       
    56:         mov ecx, 0x38D                     
    57:         wrmsr

 

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
389 Views

Your code is trying to set bit 31, not bit 32.  (Bit addresses start with zero, not one).   From the discussion in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual, only bits 0,1,2 and 32, 33, 34 are writable, the rest are reserved and the hardware does actually track which bits are writable.

0 Kudos
Sanjeev_D_
Beginner
389 Views

 

Dear John McCalpin, 

Thank you for the reply. I have two queries. 

1) I tried to configure the general performance counter to generate PMI interrupt after every N retired instructions. 

For N=1001, I write -1000 in msr 0xc1. After 1001 retired instructions, the PMI interrupt has to occur due to overflow.
However, my implementation shows that the PMI interrupt is generated only once. Could you please let me know if my configuration is correct or if I am missing something ? Do I have to write -1000 in msr 0xc1 again while handling the interrupt ?

Following is my setup. 

__asm {                    
    
    //IA32_PERF_GLOBAL_OVF_CTRL MSR
        xor edx, edx                      
        mov eax, 0x00000001                 
        mov ecx, 0x390                     
        wrmsr     
    
    //IA32_PERF_GLOBAL_CTRL MSR address    0x38F      
        xor edx, edx                      
        mov eax, 0x00000001    
        mov ecx, 0x38F                      
        wrmsr
    
    //set -1000 as a overflow counter    
        mov eax, -1000
        mov ecx, 0xc1
        wrmsr

        xor edx, edx                 
        mov eax, 0x005100C0   
        mov ecx, 0x186                    
        wrmsr
        
        }


2) How can I configure the fixed performance counter to generate PMI interrupt after N retired instructions ?

Thank you in advance.

0 Kudos
McCalpinJohn
Honored Contributor III
389 Views

I have never worked on the interrupt handlers for PMIs, but I do believe that they normally reset the counter to (MAXVAL - trip_count) before returning to the user code.   Otherwise you would have to wait the full 2^48 increments before the next overflow.  For retired instructions this will probably take longer than you are interested in waiting.....  

Many Intel processors have limitations in what you are allowed to write to the programmable counter MSRs, so newer processors provide a "full-width" alias for each of these.   E.g. for Counter 0, the counter that you normally read is MSR 0x31 (IA32_PMC0), but if you want to write more than the lower 32 bits you need to write to MSR 0x431 (IA32_A_PMC0).   This is described in Section 18.2.5 of Volume 3 of the SW Developer's Guide.

To use one of the fixed function performance counters the procedure is almost identical.  You write (MAXVAL - trip_count) to the IA32_FIXED_CTR0 counter MSR and set up the IA32_FIXED_CTR_CTRL MSR to enable overflows on that counter.

0 Kudos
Zirak
Beginner
389 Views

Hello Patrick,

I really appreciate if you provide me a sample code how to serialize msr (RDMSR and WRMSR) instructions. I have found an article (How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures). In which the author explain serializing instructions (CPUID, RDTSC and RDTSCP) to read more accurate cycles of long loops. Is this can be applied on msr? How ever mfence and lfence instructions also have been used to reduce disturbances or noise. I am not sure which one might give accurate results?

 

 

Patrick Fay (Intel) wrote:

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

0 Kudos
McCalpinJohn
Honored Contributor III
389 Views

As I described at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214#comment-1898483, "serializing" is complex subject, and requires that you state your requirements with a great deal of precision.   In many cases the type of serialization that you think you want is simply not possible without huge (many hundreds of cycles) overheads.

Even when the definitions look fairly precise, further investigation often shows that there are cases that are not covered.  For example the RDTSCP instruction is not allowed to "execute" until all prior instructions in program order have "executed".  One problem is that "execution" is not an instantaneous event.  Every instruction is pipelined to some degree, and instructions that access memory are in the "executing" state for anywhere between ~4 cycles and >1000 cycles.  A more precise definition would require clarification between the times that an instruction "begins execution" and "completes execution".   Even this may not be precise enough, since the results of "completing execution" become visible at different times to different functional units, depending on register bypass and/or cache bypass implementations.   Does the definition say that the RDTSCP instruction cannot "begin" execution (which takes ~36 cycles) until all prior instructions have "completed" execution?  Or does the definition say that the RDTSCP instruction cannot begin execution until such a point that the TSC value returned is guaranteed will point to a time no earlier than the latest cycle in which any prior instruction "completed execution".   The definition of "execution" and the concepts of "before" and "after" become even fuzzier when you consider that the RDTSCP instruction is microcoded, executing about 22 uops, with a minimum repeat latency of 30 cycles or longer.

Fortunately for your current use case, the WRMSR instruction is listed as a serializing instruction in Section 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual.  RDMSR is not a serializing instruction, but that may or may not matter, depending on what your specific requirements are.   I should add that, like CPUID, WRMSR is a very slow instruction.   I don't have recent timings, but if I recall correctly this was taking 100-200 cycles on a Xeon E3 (Sandy Bridge) processor.  Because the MSR interface is an abstraction to a communication network that spans the entire chip, it seems likely that the latency of MSR reads and writes will vary depending on the core making the request and the physical location of the register being accessed. For MSRs with thread scope or core scope the RDMSR and WRMSR instructions can only access the *local* copies.  If you want to read/write MSRs associated with a different logical processor, you will need to set up an interprocessor interrupt, which probably has a cost of a few thousand cycles.

0 Kudos
Reply