The accuracy of the performance counter statisitics

Xin_X_1 · ‎08-07-2013

Hi ,

I am trying to play with the Intel performance counter monitor tool. I reuse some of its code and write a kernel module to read performance counter data. I basically follows the procedures in PCM::program() to set up the on core counters, and then use rdmsr wrmsr to read/write performance counters. I found that the data collected are not accurate when time between two read are small. For example, here are my procedures:

/* routines to start the counter of # of branch instructions, mimic PCM:program() code*/

/* routines to read the counter, using rdmsr and wrmsr*/

for ( i =0; i < 1000; ++i) arr = 1;

/* routines to read the counter again, using rdmsr and wrmsr*/

The number of branch instructions should be 1000, but the reading constantly shows about ~6500 (after - before). I am aware of that rdmsr has certain latency, probably 100+ cycles. But extra 5500-branch-instruction seems too large for 100+ cycles. I am not sure if this is because of my set up, or performance counters should not be used in this way? Can someone give me some suggestions? Thanks.

Patrick_F_Intel1 · ‎08-07-2013

Hello Xin,

You are running into the overhead of calling the driver. Your user mode (ring 3) code has to do a call to the driver, which calls the kernel (switching from ring 3 to ring 0), does the rdmsr/wrmsr instruction, and returns. And there are probably multiple calls to the driver per call to PCM.

If you truly want to read the MSR with the minimum overhead, you can use the rdpmc instruction but this is not easy. usually rdpmc is not enabled to be read from ring3. There is a bit in CR4 that has to be set. It is the PCE bit. On linux there is a driver that enables rdpmc (https://github.com/andikleen/simple-pmu). It works on older versions of Linux. I don't know of a windows driver that enables rdpmc. Even if you enable rdpmc from ring3, you will only be able to read the core PMU counters (3 fixed & 3-8 variable core counters). You will still have to make a trip into ring 0 to do wrmsr or to rdmsr any other msr besides the core PMC counters.

I've used Andi's driver to do very low overhead measurements before. But it is not for the faint of heart.

Hope this helps,

Pat

Xin_X_1 · ‎08-07-2013

Hi Patrick,

Thank you for you quick response. I actually reuse only the code of PCM::program() and put them in a kernel module, but not the code for accessing msr. When read/write msr, I use assembly code like this

asm volatile ("\trdmsr\n" : "=a" (lo), "=d" (hi) : "c" (msr))

to directly access msr in the kernel module. This does not involve ring transitions.

Using rdpmc should be one solution to reduce the overhead. Based on your experience, what is the granularity that rdpmc/rdmsr can achieve? Can they measure 1000 or even 100 instructions/cycles accurately?

Thank you

Patrick Fay (Intel) wrote:

Hello Xin,

You are running into the overhead of calling the driver. Your user mode (ring 3) code has to do a call to the driver, which calls the kernel (switching from ring 3 to ring 0), does the rdmsr/wrmsr instruction, and returns. And there are probably multiple calls to the driver per call to PCM.

If you truly want to read the MSR with the minimum overhead, you can use the rdpmc instruction but this is not easy. usually rdpmc is not enabled to be read from ring3. There is a bit in CR4 that has to be set. It is the PCE bit. On linux there is a driver that enables rdpmc (https://github.com/andikleen/simple-pmu). It works on older versions of Linux. I don't know of a windows driver that enables rdpmc. Even if you enable rdpmc from ring3, you will only be able to read the core PMU counters (3 fixed & 3-8 variable core counters). You will still have to make a trip into ring 0 to do wrmsr or to rdmsr any other msr besides the core PMC counters.

I've used Andi's driver to do very low overhead measurements before. But it is not for the faint of heart.

Hope this helps,

Pat

Xin_X_1 · ‎08-07-2013

I figure out the problem. It is because of a mistake in my code. Now the reading seems very accurate.

I am wondering if where I can find some document that discuss the accuracy of the performance counter in general. Can anyone give me some pointers? Thanks

Bernard · ‎08-07-2013

What kind of accuracy do you mean?I think that only the info about an accuracy of rdtsc instruction is freely available.

Bernard · ‎08-07-2013

Performance Counters do not have an option to count events as function o finstruction pointer.They will simply incremet the counter in your case by looking at uops which constitute branch instructions.Moreover when there is high frequency of context switches not only your thread will be measured.

Xin_X_1 · ‎08-07-2013

Hello iliyapolak,

Thank you for your reply. For example, if I have only one instruction inbetween two rdmsr instructions ( set up to count # of the retired instrucitons), will the difference between two readings be exactly 1? My test result is not exactly 1, but 3. In fact, this is accurate enough for me. But I am wondering if I someone have documented this more comprehesively, such as different counters: core counter, uncore counte and etc ... or different instructions: rdpmc, rdmsr and etc.

I quickly google it, and found this post talking about the latency of rdtsc, it suggests that there should be at least 1000 cycles between two readings to make the counting accurate. Should I make the similar assumption when using rdmsr?

http://software.intel.com/en-us/forums/topic/305287

Patrick_F_Intel1 · ‎08-07-2013

Hello Xin,

I'm confused. Are you saying you are putting code into the ring0 driver and trying to time it in ring0? Because rdmsr can only be executed in ring0.

The rdpmc and rdmsr instructions take about 100-200 cycles. Yes counts are accurate. But you are again sort of confusing me. rdmsr gets the value of the counter (if the msr that rdmsr is reading is a PMU msr). So if you programmed clockticks.ref into the counter, read the msr, run some code and then reread the msr, you would get the unhalted clockticks and it would reflect the overhead of the rdmsr instruction. But if you programmed instructions.retired and reran your test, you would get the number of instructions plus 1 for the rdmsr.

Does that make sense?

Pat

Bernard · ‎08-07-2013

Hi Xin

What ary trying to count the number of branches?

Bernard · ‎08-07-2013

I think that during any CPU-clock cycles related measurement any measured code should run(or be looped) longer that total sum of for example rdtsc instructions.

Xin_X_1 · ‎08-07-2013

Hi Patrick,

Yes, I put code into a kernel module, so all these code are in ring 0. Here is what my code looks like:

PCM1 is programmed to count number of retired instructions, (0xc0 for event number, and 0x00 for umask, according to table 19-1 in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide)

long long int lo_start, lo_end, hi_start, hi_end,

long msr = IA32_PCM1;

/*program performance counter routines*/

asm volatile ("rdmsr" : "=a" (lo_start), "=d" (hi_start) : "c" (msr)); //start to read

asm volatile ("mov $0, r10"); //run one dummy instruction

asm volatile ("\trdmsr\n" : "=a" (lo_end), "=d" (hi_end) : "c" (msr)); //read again

after putting low and high 32 bit value together, the different of end and start is 3. Shouldn't I expect 1 in difference?

Thank you

Patrick_F_Intel1 · ‎08-07-2013

You need to look at the assembly code, not the __asm() stuff. So you'll have to disassemble the compiled code. I think you'll see there are 2 instructions between the two rdmsr instructions.

Xin_X_1 · ‎08-07-2013

Hi Patrick,

You are right, I did see two extra mov between rdmsr, that explains the reading. the reading seems very accurate even if rdmsr has certain latency. Thank you very much.

McCalpinJohn · ‎08-07-2013

Inline RDPMC instructions (or the corresponding RDMSR instruction in the kernel) should count correctly for even very small code sections. I just tested a bunch of loops that did nothing but execute RDPMC instructions and when I used the RDPMC instructions to count branches (Event 0x88, Umask 0x81), the values incremented exactly when they were supposed to -- every iteration for the original loop, and once every 8 iterations when I unrolled the loop by 8.

On the other hand, the RDPMC instruction takes time, and that can distort several aspects of the code under test. The overhead of the RDPMC instruction is almost certain to vary across products. On my Xeon E5-2680 (Sandy Bridge EP) systems, repeated consecutive calls to the cycle counting event (Event 0x3C, Umask 0x00, or the corresponding fixed-function event accessed by executing RDPMC with counter number of (1<<30)+1) almost always show deltas of 39 cycles from one reading to the next when I save the low-order 32-bits into a cache-contained array. This increases by a few cycles if I combine the upper and lower 32-bit results into a 64-bit value and save it into a cache-contained array, with 43 cycles as the most common delta in cycle count values.

Some care needs to be taken with measuring small code sections, since the RDPMC instruction is not guaranteed to be ordered with respect to surrounding instructions. I saw no deviations in my simple loop that was not doing anything except reading the PMC and saving the results, but more complex loops could result in out-of-order execution.

Bernard · ‎08-08-2013

I was thinking about the possibility of polluted mesurement when latency of instruction which triggers the process of measurement is greater than latency of profiled instructions.For example measured instruction(s) can execute out of order or even at the same time in parallel with profiling instruction and because of shorter latency of profiled instruction even miniscule changes(counted cpu-cycles) cannot be effectively measured.

Patrick_F_Intel1 · ‎08-08-2013

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

Xin_X_1 · ‎08-08-2013

Hello Dr. McCalpin

Thank you for your information. Your measurements results are similar to mine, the latency is about 30 cycles in nehalem-ep processors.

John D. McCalpin wrote:

Inline RDPMC instructions (or the corresponding RDMSR instruction in the kernel) should count correctly for even very small code sections. I just tested a bunch of loops that did nothing but execute RDPMC instructions and when I used the RDPMC instructions to count branches (Event 0x88, Umask 0x81), the values incremented exactly when they were supposed to -- every iteration for the original loop, and once every 8 iterations when I unrolled the loop by 8.

On the other hand, the RDPMC instruction takes time, and that can distort several aspects of the code under test. The overhead of the RDPMC instruction is almost certain to vary across products. On my Xeon E5-2680 (Sandy Bridge EP) systems, repeated consecutive calls to the cycle counting event (Event 0x3C, Umask 0x00, or the corresponding fixed-function event accessed by executing RDPMC with counter number of (1<<30)+1) almost always show deltas of 39 cycles from one reading to the next when I save the low-order 32-bits into a cache-contained array. This increases by a few cycles if I combine the upper and lower 32-bit results into a 64-bit value and save it into a cache-contained array, with 43 cycles as the most common delta in cycle count values.

Some care needs to be taken with measuring small code sections, since the RDPMC instruction is not guaranteed to be ordered with respect to surrounding instructions. I saw no deviations in my simple loop that was not doing anything except reading the PMC and saving the results, but more complex loops could result in out-of-order execution.

Xin_X_1 · ‎08-08-2013

This may be a good idea. So I can just use two CPUID instructions to guard the measured code region, if I don't care about performance degradations.

thanks.

Patrick Fay (Intel) wrote:

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

Patrick_F_Intel1 · ‎08-08-2013

And I've seen folks still in mfence instructions sometimes if they are worried about exact counts of memory loads/stores.

Note that, in the ring3 (user land), you may also get interrupts right in the middle of your code, which may mess up your counts.

Sanjeev_D_ · ‎11-04-2015

Hi All,

I tried to obtain the hardware performance counter data on windows (win 7, 32-bit, x86 - Intel Xeon processor) platform using kernel mode driver. But, I was not successful.
In my custom driver, I wrote the following assembly code to read the counter data:

NTSTATUS DriverEntry (...){

__asm {

mov ecx, 0x309; // fixed IA32_PERF_FIXED_CTR0 -- Inst_Retired.Any
rdmsr;
mov lowvalue, eax;
mov highvalue, edx;
}

DbgPrint("MSR output: %x \t %x \r\n", lowvalue, highvalue);

}

Could please help me to know, if I am making any mistake here. Please let me know how can I get this counter data. I replaced "rdmsr" instruction with "rdpmc" instruction as well, but it was not successful either.

Thanks in advance for the help.

McCalpinJohn · ‎11-04-2015

It would help to have some idea of what you mean by "not successful"....

The fixed-function performance counter accessed via MSR 0x309 has to be enabled by setting (1) bit 32 of the IA32_PERF_GLOBAL_CTRL MSR (0x38F), and (2) bits 0 (and also bit 1 if you want to count in user space as well as kernel space) of the IA32_FIXED_CTR_CTRL MSR (0x38D).

You also need to be sure that you are reading the "before" and "after" values on the same core. I don't know how that is done in Windows, but in Linux device driver (kernel) code this is usually done by setting up an inter-processor interrupt targeting the desired core so that it will be the one reading the MSR. It is probably also possible to pin the kernel thread to the desired core for the duration of the test (?). Pinning the thread to a single core is also required for user-space code that uses the RDPMC instruction.