I was trying to compare the output of PCM vs PAPI for a software router tool (Click)
I find a non-trivial difference between the output of PAPI and PCM for simple metrics like total cycles and total L3 misses. Each tool is self-consistent (modulo some stochastic noise)
with PAPI PAPI_TOT_CYC = 560548883 PAPI_L3_TCM = 993702
with PCM getCycles = 1288193707, getL3CacheMisses = 746465
If I understand the semantics correctly, both are accessing the same hardware counters and for the same workload the values should be in the same ballpark, but this huge discrepancy is really puzzling
Are there known issues that result in different outputs with different performance tools?
There are different ways to measure L3 cache misses and they vary with architecture. Therefore, let's focus on the cycles first. Are you measuring cycles on 1 core or on the complete CPU? PCM reports the cycles including turbo mode (in contrast to "reference cycles"). Is PAPI doing the same?
What kind of CPU are you using and how many threads is your workload using?
getCycles is returning the count for CPU_CLK_UNHALTED.THREAD event (which also accounts Turbo Boost as Thomas mentioned). Which underlying hardware eventdoes PAPI map to PAPI_TOT_CYC?
In PCM the mapping of getL3CacheMisses to HW event depends on processor/architecture type. What is your cpu model?
Are you measuring (aggregated) cyclesfor all cores or a particular core or for just your thread(s)when usingPAPI?
PCM can measure cycles/cache missesfor particular cores or sockets or the whole system. The scope of the measurement depends on what PCM state object you are using:CoreCounterState or SocketCounterState or SystemCounterState.
Thanks for your inputs.
1. I am running a single threaded program on a Intel Xeon CPU X5560 @ 2.80GHz (26)
2. I am running PCM in the default mode (I didnt set any of the Core/Socket or System settings)
3. PAPI seems to map PAPI_TOT_CYC to the same counter as well as far as i can see:
heres the output of
papi_avail -e PAPI_TOT_CYC
Available events and hardware information.
PAPI Version : 22.214.171.124
Vendor string and code : GenuineIntel (1)
Model string and code : Intel Xeon CPU X5560 @ 2.80GHz (26)
CPU Revision : 5.000000
CPUID Info : Family: 6 Model: 26 Stepping: 5
CPU Megahertz : 2793.259033
CPU Clock Megahertz : 2793
Hdw Threads per core : 1
Cores per Socket : 4
NUMA Nodes : 1
CPU's per Node : 4
Total CPU's : 4
Number Hardware Counters : 16
Max Multiplex Counters : 512
The following correspond to fields in the PAPI_event_info_t structure.
Event name: PAPI_TOT_CYC
Event Code: 0x8000003b
Number of Native Events: 1
Short Description: |Total cycles|
Long Description: |Total cycles|
Developer's Notes: ||
Derived Type: |NOT_DERIVED|
Postfix Processing String: ||
Native Code: 0x40000000 |UNHALTED_CORE_CYCLES|
Number of Register Values: 4
Register[ 0]: 0x0000003c |Event Code|
Register[ 1]: 0x0000003c |Event Code|
Register[ 2]: 0x0000003c |Event Code|
Register[ 3]: 0x0000003c |Event Code|
Native Event Description: |count core clock cycles whenever the clock signal on the specific core is running (not halted). Alias to event CPU_CLK_UNHALTED:THREAD
a few more questions:
How do you run PCM: are you using the command line pcm.x and start your program from pcm.x to measure and outputthe metrics? If you use the command line pcm.x interface, you can post the output here to shed more light. Or did you instrument your program/function calls in the program using PCM API (retrieving SystemCounterState objects and calling getCycle methods on these objects) ?
I have some questions about the results you are seeing.
The questions are kind of basic but need to be asked...
What workload are you running while you are taking the measurements?
The PAPI measurement (at 5.6e8 cycles) covers about 0.2 cpu seconds.
The PCM measurement (at1.29e9 cycles) covers about 0.46 cpu seconds.
If you are measuring a basically idle system then the 'unhalted cycles' can vary quite a bit depending on what random process is running.
Or if the cpus are halted, then the unhalted clockticks won't increment.
I would expect that, if you ran workload that kept all the CPUs busy for say, 10 seconds, then PAPI and PCM would agree within a percent or so.
1. This is the "Click" modular router -- I am basically reading an offline packet trace and processing it with some modules within click
2. I dont think the cpus are halted on disk reads -- I also ran a similar workload where I load all packets into memory first and find that the numbers are similar
3. I also checked cpu utilization with atop etc, and it is usually close to 100 (and not really stalled while running)
Re: your point about random processes -- I ran several runs and the numbers within each library is self consistent, and consistently different from the other library.
does Click/your workload include processing in Linux kernel module? You can see that if youobserve some"system time" in top or vmstat utility.
Does PAPI account clock cycles spent outside of your user thread (in kernel)? Intel PCM accounts every clock tickon the cores: no matter if it was system (ring0)or user cpu time (ring3).
The rdtsc instruction returns the time stamp counter.
The time stamp counter (on the L5520 nehalem-based processor) continues counting when the cpu is halted.
The getCycle() routine uses the CPU_CLK_UNHALTED.THREAD event which stops counting when the cpu is halted.
This is probably the difference that you are seeing.
the getCycles() function returns the CPU_CLK_UNHALTED.THREAD event count.It isthe number core clock cycles when signal on a specific core is running (not halted).
The counter does not advance in the following conditions:
- an ACPI C-state is other than C0 for normal operation
- STPCLK+ pin is asserted
- being throttled by TM1
- during the frequency switching phase of a performance state transition
The getRefCycles() function returns the CPU_CLK_UNHALTED.REF event countwhichisthe number of reference clock cycles while clock signal on the core is running. The reference clock operates at a fixed frequency, irrespective of core frequency changes due to performance state transitions. Note that CPU_CLK_UNHALTED.THREAD can exceed the CPU_CLK_UNHALTED.REF event count if Turbo Boost kicks in.
one can find documentation for the PCM methods in Doxygen format in the cpucounters.h header. HTML documentation can be easily generated from it (the included doxygen project file iscalled "Doxyfile").