- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been using VTune for a while and I would appreciate some advice about the metrics I'm trying to measure (I'm using a processor from Harpertown family - Core microarchitecture):
1) Stall time: Processor's documentation states that it can issue/retire up to 4 instructions per cycle. Assuming that the ideal CPI in this case is 0.25, may I compute the relative stall time as (Measured_CPI-0.25)/Measured_CPI? E.g. Assuming that the measured CPI is 1.25, is it correct to say that the total stall time is 80% (1/1.25)?
2) L2 miss penalty: How correct/accurate is to compute stall time due to L2 misses as: L2_misses * avg_mem_latency? Btw, what is the most precise way to measure average memory latency? I've tried to use the counter "BUS_REQUEST_OUTSTANDING", as suggested in the Intel 64 and IA-32 Optimization Reference Manual, but the results using this counter do not make sense (in some cases, VTune reports BUS_REQUEST_OUTSTANDING events > CPU_CLK_UNHALTED.CORE events)
3) L2 cache miss rate: I was wondering whether the builtin "L2 Cache Miss Rate" ratio afforded by VTune is inconsistent with what most of us consider as "miss rate" (number of misses in L2 divided by number of accesses in L2). Being "L2 Cache Miss Rate" computed as L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, shouldn't it be called "miss per instruction"? Is it correct to compute L2 miss rate as:
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are VTune counters for various categories of stall cycles.
CPI is over-rated by several of the documents; for example, the "best" CPI is generated during spin-wait loops.
Attempts have been made to improve measurability of latency in more recent CPU models; it would not surprise me if VTune over-estimated impact of misses. The primary job is to find out where the misses are significant.
Cache misses retired per instruction might be a useful indicator; a more traditional measure might be misses retired per lines accessed, but that gives less indication of the severity of the workload. I don't know what you'd do with requests; usually there are several requests per line accessed.
It seems there is some scope for opinion here; you probably don't need more of mine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Harpertown is the code name of a quad-core server processor, Intel Core 2 family.
Please read article http://assets.devx.com/goparallel/18027.pdf by Dr. David Levinthal
-Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are VTune counters for various categories of stall cycles.
CPI is over-rated by several of the documents; for example, the "best" CPI is generated during spin-wait loops.
Attempts have been made to improve measurability of latency in more recent CPU models; it would not surprise me if VTune over-estimated impact of misses. The primary job is to find out where the misses are significant.
Cache misses retired per instruction might be a useful indicator; a more traditional measure might be misses retired per lines accessed, but that gives less indication of the severity of the workload. I don't know what you'd do with requests; usually there are several requests per line accessed.
It seems there is some scope for opinion here; you probably don't need more of mine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Harpertown is the code name of a quad-core server processor, Intel Core 2 family.
Please read article http://assets.devx.com/goparallel/18027.pdf by Dr. David Levinthal
-Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, they are approximate data - depends on variant user's code sequence.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess you might try to measure the latency of the main memory for your particular machine by running a simple triadbenchmark (a = b + d*c) with amount of data fitting into L2 cache and twice of L2 chache size. The difference in clockticksfor both benchmarks (if devided by number of mem accesses) would be an estimation ofthe latency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess you might try to measure the latency of the main memory for your particular machine by running a simple triadbenchmark (a = b + d*c) with amount of data fitting into L2 cache and twice of L2 chache size. The difference in clockticksfor both benchmarks (if devided by number of mem accesses) would be an estimation ofthe latency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it's Core microarchitecture (processor is a Xeon 5420).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the answer. If you take a look at the VTune help for BUS_REQUEST_OUTSTANDING, it says: "The event counts only full-line cacheable read requests from either the L1 data cache or the L2 prefetchers." So, the big number in your case can be explained that the latencies caused by prefetcher were also counted.
As for the statement regarding BUS_REQUEST_OUTSTANDING event for Core2 micro architecture in the Intel 64 and IA-32 Optimization Reference Manual, it's not accurate. Use the MEM_LOAD_RETIRED.L2_LINE_MISS event instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the answer. If you take a look at the VTune help for BUS_REQUEST_OUTSTANDING, it says: "The event counts only full-line cacheable read requests from either the L1 data cache or the L2 prefetchers." So, the big number in your case can be explained that the latencies caused by prefetcher were also counted.
As for the statement regarding BUS_REQUEST_OUTSTANDING event for Core2 micro architecture in the Intel 64 and IA-32 Optimization Reference Manual, it's not accurate. Use the MEM_LOAD_RETIRED.L2_LINE_MISS event instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Be aware, that you can't disable all (four)HW prefetchers form BIOS.
VTune doesoffer a direct counter for measuring average memory latency during execution of a real application. For Core2 microarchitecture it's MEM_LOAD_RETIRED.L2_LINE_MISS event. But the penalty values evaluated in with thisevent are only approximations.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page