I have been using VTune for a while and I would appreciate some advice about the metrics I'm trying to measure (I'm using a processor from Harpertown family - Core microarchitecture):
1) Stall time: Processor's documentation states that it can issue/retire up to 4 instructions per cycle. Assuming that the ideal CPI in this case is 0.25, may I compute the relative stall time as (Measured_CPI-0.25)/Measured_CPI? E.g. Assuming that the measured CPI is 1.25, is it correct to say that the total stall time is 80% (1/1.25)?
2) L2 miss penalty: How correct/accurate is to compute stall time due to L2 misses as: L2_misses * avg_mem_latency? Btw, what is the most precise way to measure average memory latency? I've tried to use the counter "BUS_REQUEST_OUTSTANDING", as suggested in the Intel 64 and IA-32 Optimization Reference Manual, but the results using this counter do not make sense (in some cases, VTune reports BUS_REQUEST_OUTSTANDING events > CPU_CLK_UNHALTED.CORE events)
3) L2 cache miss rate: I was wondering whether the builtin "L2 Cache Miss Rate" ratio afforded by VTune is inconsistent with what most of us consider as "miss rate" (number of misses in L2 divided by number of accesses in L2). Being "L2 Cache Miss Rate" computed as L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, shouldn't it be called "miss per instruction"? Is it correct to compute L2 miss rate as:
There are VTune counters for various categories of stall cycles.
CPI is over-rated by several of the documents; for example, the "best" CPI is generated during spin-wait loops.
Attempts have been made to improve measurability of latency in more recent CPU models; it would not surprise me if VTune over-estimated impact of misses. The primary job is to find out where the misses are significant.
Cache misses retired per instruction might be a useful indicator; a more traditional measure might be misses retired per lines accessed, but that gives less indication of the severity of the workload. I don't know what you'd do with requests; usually there are several requests per line accessed.
It seems there is some scope for opinion here; you probably don't need more of mine.
Yes, they are approximate data - depends on variant user's code sequence.
I guess you might try to measure the latency of the main memory for your particular machine by running a simple triadbenchmark (a = b + d*c) with amount of data fitting into L2 cache and twice of L2 chache size. The difference in clockticksfor both benchmarks (if devided by number of mem accesses) would be an estimation ofthe latency.
Here is the answer. If you take a look at the VTune help for BUS_REQUEST_OUTSTANDING, it says: "The event counts only full-line cacheable read requests from either the L1 data cache or the L2 prefetchers." So, the big number in your case can be explained that the latencies caused by prefetcher were also counted.
As for the statement regarding BUS_REQUEST_OUTSTANDING event for Core2 micro architecture in the Intel 64 and IA-32 Optimization Reference Manual, it's not accurate. Use the MEM_LOAD_RETIRED.L2_LINE_MISS event instead.
Be aware, that you can't disable all (four)HW prefetchers form BIOS.
VTune doesoffer a direct counter for measuring average memory latency during execution of a real application. For Core2 microarchitecture it's MEM_LOAD_RETIRED.L2_LINE_MISS event. But the penalty values evaluated in with thisevent are only approximations.