I started using Vtune to analyze the performance of my application and found that many of the counters are not tallying. Below I just printed a few of them. To start with CPU_CLK_UNHALTED.THREAD is showing a number which doesn't match the CPU time. The CPU I'm using is a Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (Haswell). Based on this I was expecting the cycles spent on the thread to match the CPU time @2.5Ghz. Please note that for this run I just sampled a single logical core and that core was running a single thread exclusively (core is isolated). The above doesn't match by miles, based on the event counter the program only run for CPU_CLK_UNHALTED.THREAD / 2.5Ghz = 0.7 sec!
Besides this we can also see the events counts for the cache misses. They don't tally either. I was expecting L1_miss = L2_hit + L2_miss; L2_miss = L3_hit + L3_miss. There is also the CPI reported... as per the events the CPI would be of 0.013!! This can't be the case as the best theoretical CPI we can achieve is 0.25 (there are 4 execution pipes).
Am I missing something? Could it be that I have a missconfigured Vtunes: as in the event masks and Umask values are being incorrectly passed by vtune because it thinks we are on a different uarchitecture (say Haswell vs Ivy Bridge)? Any hints would be appreciated!
Elapsed Time: 60.059s
CPU Time: 42.334s
CPI Rate: 0.013
Total Thread Count: 1
Paused Time: 0s
Hardware Event Type
Hardware Event Count Hardware Event Sample Count Events Per Sample
CPU_CLK_UNHALTED.REF_TSC 105,834,158,751 52,917 2000003
CPU_CLK_UNHALTED.THREAD 1,770,002,655 885 2000003
INST_RETIRED.ANY 141,332,211,998 70,666 2000003
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS 526,415,792 752 100003
MEM_LOAD_UOPS_RETIRED.L1_MISS_PS 52,654,078,981 3,761 2000003
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS 247,807,434 354 100003
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS 199,505,985 285 100003
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS 47,970,139 137 50021
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 38,516,170 110 50021
The REF_TSC counter matches the reported CPU seconds perfectly...
The INST_RETIRED.ANY looks fine -- about 1.33 instructions per TSC cycle.
The CPU_CLK_UNHALTED.THREAD number is weird and obviously wrong. (The value implies 80 instructions per cycle, which would be really exciting if it were real....)
There could be some confusion with the user and kernel enable bits.
There could be interference from an NMI watchdog process (these often use CPU_CLK_UNHALTED.THREAD).
The only way I know how to debug this stuff is to use msrtools to dump the performance counter configuration registers during the run. At that point it is usually just as easy to do all the measurements myself and skip using VTune entirely.....
Thanks for the prompt reply!
Regarding NMI watchdog I disabled it (echo 0 > /proc/sys/kernel/nmi_watchdog)
Not only the clock ticks are not matching but also the L1/L2/L3 counts as well. I re run these counters using perf and getting the raw counters from Intel's User Manual Vol2 Chapter 19 for the above mentioned CPU, and these do add up (there is a glitch on L2_Miss = L3_Hit + L3_Miss) but the counters do make much more sense in general...
I'm particularly interested in Vtunes because of the event sampling feature... roughly matching the event counts to the instruction counter (and source line) where they are generated is very useful to pinpoint the code regions that need attention.
Meanwhile I'll take a look into msrtools!
root@test:/md> perf stat -e instructions,r01d1,r08d1,r02d1,r10d1,r04d1,r20d1,r01d3 -C 4 -B sleep 42
Performance counter stats for 'sleep 42':
136220124158 instructions # 0.00 insns per cycle [49.97%]
39301638501 r01d1 [49.92%] <= L1_Hit
136907962 r08d1 [49.95%] <= L1_Miss
92917335 r02d1 [49.98%] <= L2_Hit
44093038 r10d1 [50.00%] <= L2_Miss
19069173 r04d1 [40.06%] <= L3_Hit
1205756 r20d1 [40.07%] <= L3_Miss
1242890 r01d3 [40.04%] <= L3 Missed but serviced by Local Dram
42.000461652 seconds time elapsed\