About VTune metrics

mrnm · ‎11-21-2009

Hi,

I have been using VTune for a while and I would appreciate some advice about the metrics I'm trying to measure (I'm using a processor from Harpertown family - Core microarchitecture):

1) Stall time: Processor's documentation states that it can issue/retire up to 4 instructions per cycle. Assuming that the ideal CPI in this case is 0.25, may I compute the relative stall time as (Measured_CPI-0.25)/Measured_CPI? E.g. Assuming that the measured CPI is 1.25, is it correct to say that the total stall time is 80% (1/1.25)?

2) L2 miss penalty: How correct/accurate is to compute stall time due to L2 misses as: L2_misses * avg_mem_latency? Btw, what is the most precise way to measure average memory latency? I've tried to use the counter "BUS_REQUEST_OUTSTANDING", as suggested in the Intel 64 and IA-32 Optimization Reference Manual, but the results using this counter do not make sense (in some cases, VTune reports BUS_REQUEST_OUTSTANDING events > CPU_CLK_UNHALTED.CORE events)

3) L2 cache miss rate: I was wondering whether the builtin "L2 Cache Miss Rate" ratio afforded by VTune is inconsistent with what most of us consider as "miss rate" (number of misses in L2 divided by number of accesses in L2). Being "L2 Cache Miss Rate" computed as L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, shouldn't it be called "miss per instruction"? Is it correct to compute L2 miss rate as:

L2_RQSTS.SELF.ANY.I_STATE / L2_RQSTS.SELF.ANY.MESI ?

TimP · ‎11-21-2009

In order to issue the maximum number of instructions per cycle, instructions have to be ready to go in a sufficient number of categories to use that many different ports. In order to do that, you might have to fill in relatively idle categories with useless instructions.
There are VTune counters for various categories of stall cycles.
CPI is over-rated by several of the documents; for example, the "best" CPI is generated during spin-wait loops.
Attempts have been made to improve measurability of latency in more recent CPU models; it would not surprise me if VTune over-estimated impact of misses. The primary job is to find out where the misses are significant.
Cache misses retired per instruction might be a useful indicator; a more traditional measure might be misses retired per lines accessed, but that gives less indication of the severity of the workload. I don't know what you'd do with requests; usually there are several requests per line accessed.
It seems there is some scope for opinion here; you probably don't need more of mine.

Peter_W_Intel · ‎11-21-2009

Harpertown is the code name of a quad-core server processor, Intel Core 2 family.

Please read article http://assets.devx.com/goparallel/18027.pdf by Dr. David Levinthal

-Peter

mrnm · ‎11-21-2009

Hi tim,

Thanks for the quick reply.

So if I understood well, ideal CPI is more a theoretical limit than a practical performance baseline. The correct way to compute the relative impact of stalls is:

RS_UOPS_DISPATCH.CYCLES_NONE / CPU_CLK_UNHALTED.CORE

With respect to L2 indicators, yes, I agree that "L2 misses per instruction" is probably more useful than miss rate. I only pointed out that the name of the ratio is kind of misleading ;) Still, I would like to measure the L2 miss rate to compare with results published elsewhere. Do you know how can I do that?

Quoting - tim18

In order to issue the maximum number of instructions per cycle, instructions have to be ready to go in a sufficient number of categories to use that many different ports. In order to do that, you might have to fill in relatively idle categories with useless instructions.
There are VTune counters for various categories of stall cycles.
CPI is over-rated by several of the documents; for example, the "best" CPI is generated during spin-wait loops.
Attempts have been made to improve measurability of latency in more recent CPU models; it would not surprise me if VTune over-estimated impact of misses. The primary job is to find out where the misses are significant.
Cache misses retired per instruction might be a useful indicator; a more traditional measure might be misses retired per lines accessed, but that gives less indication of the severity of the workload. I don't know what you'd do with requests; usually there are several requests per line accessed.
It seems there is some scope for opinion here; you probably don't need more of mine.

mrnm · ‎11-21-2009

Thanks for indicating the article, Peter, but I still have one more question. In the article it is suggested to compute the L2 miss penalty as:

L2 Miss = event count * penalty/cycles

I understand that "event count" here means the total number of L2 misses, but I could not find anywhere in the paper how to measure the penalty...

I've seen in previous threads someone saying to assume this penalty is 130 cycles, but this is a quite rough approximation, specially considering that the number above is for Core2 desktops while I'm using a Xeon processor and the actual memory latency depends on the workload...

Is it possible to measure the main memory latency with VTune?

Quoting - Peter Wang (Intel)

Harpertown is the code name of a quad-core server processor, Intel Core 2 family.

Please read article http://assets.devx.com/goparallel/18027.pdf by Dr. David Levinthal

-Peter

Peter_W_Intel · ‎11-22-2009

You can find clue in David's article - "L2 miss" penalty will be ~165 cycles for desktop, ~300 cycles for server.

Yes, they are approximate data - depends on variant user's code sequence.

Regards, Peter

Vladimir_T_Intel · ‎11-23-2009

Quoting - mrnm

Is it possible to measure the main memory latency with VTune?

I guess you might try to measure the latency of the main memory for your particular machine by running a simple triadbenchmark (a = b + d*c) with amount of data fitting into L2 cache and twice of L2 chache size. The difference in clockticksfor both benchmarks (if devided by number of mem accesses) would be an estimation ofthe latency.

mrnm · ‎11-23-2009

Hi Vladimir,

Thanks for your suggestion. I had already run some memory latency tests before (from Everest and RightMark) and the results were close to the estimated 300 cycles. My only concern is that these numbers are quite dependent on the access pattern. Ideally, I would like to be able to measure memory latency during execution of the real application, at the same moment all other metrics are being collected. I've found a possible method at the Intel 64 and IA-32 Optimization Reference Manual (please see below). However, I ran some tests and divided BUS_REQUEST_OUTSTANDING byMEM_LOAD_RETIRED.L2_LINE_MISS to obtain average memory latency, and got some quite weird results (e.g. avg latency of ~2700 cycles)...

"The number L2 cache miss references can be measured by MEM_LOAD_RETIRED.L2_LINE_MISS. An estimation of overall L2 miss impact by multiplying system memory latency with the number of L2 misses ignores the OOO engines ability to handle multiple outstanding load misses. Multiplication of latency and number of L2 misses imply each L2 miss occur serially. To improve the accuracy of estimating L2 miss impact, an alternative technique should also be considered, using the event BUS_REQUEST_OUTSTANDING with a CMASK value of 1. This alternative technique effectively measures the cycles that the OOO engine is waiting for data from the outstanding bus read requests. It can overcome the over-estimation of multiplying memory latency with the number of L2 misses."

Quoting - Vladimir Tsymbal (Intel)

I guess you might try to measure the latency of the main memory for your particular machine by running a simple triadbenchmark (a = b + d*c) with amount of data fitting into L2 cache and twice of L2 chache size. The difference in clockticksfor both benchmarks (if devided by number of mem accesses) would be an estimation ofthe latency.

Vladimir_T_Intel · ‎11-26-2009

Hmm.. Good question. Let me investigate it. I'll get back to you when I collect the results. We are talking about Core2micro architecture, right?

mrnm · ‎11-26-2009

Quoting - Vladimir Tsymbal (Intel)

Hmm.. Good question. Let me investigate it. I'll get back to you when I collect the results. We are talking about Core2micro architecture, right?

Yes, it's Core microarchitecture (processor is a Xeon 5420).

Vladimir_T_Intel · ‎11-27-2009

Here is the answer. If you take a look at the VTune help for BUS_REQUEST_OUTSTANDING, it says: "The event counts only full-line cacheable read requests from either the L1 data cache or the L2 prefetchers." So, the big number in your case can be explained that the latencies caused by prefetcher were also counted.

As for the statement regarding BUS_REQUEST_OUTSTANDING event for Core2 micro architecture in the Intel 64 and IA-32 Optimization Reference Manual, it's not accurate. Use the MEM_LOAD_RETIRED.L2_LINE_MISS event instead.

mrnm · ‎11-27-2009

Hi Vladimir,

I had noticed that the definitions in the VTune help and Intel manual were different. However, in the Intel Manual the BUS_REQUEST_OUTSTANDING event is "modified"with a CMASK value of 1. I thought maybe the effect of CMASK here could be the same as for RS_UOPS_DISPATCHED event, which is widely used to measure cycles when CMASK=1, but whose definition is:"This event counts the number of micro-ops dispatched for execution".

Btw, I'm running the tests with Hardware Prefetcher and Adjacent cacheline disabled, to ease analysis of cache behavior, so prefetching should not be an issue in my case...

Just to close this thread, if I understood well, VTune do not offer any direct counter for measuring average memory latency during execution of a real application, right? We have to resort to microbenchmarks to estimate it, and assume this value is a good approximation for our target application.

Quoting - Vladimir Tsymbal (Intel)

Here is the answer. If you take a look at the VTune help for BUS_REQUEST_OUTSTANDING, it says: "The event counts only full-line cacheable read requests from either the L1 data cache or the L2 prefetchers." So, the big number in your case can be explained that the latencies caused by prefetcher were also counted.

As for the statement regarding BUS_REQUEST_OUTSTANDING event for Core2 micro architecture in the Intel 64 and IA-32 Optimization Reference Manual, it's not accurate. Use the MEM_LOAD_RETIRED.L2_LINE_MISS event instead.

Vladimir_T_Intel · ‎11-27-2009

CMASK for this event modifies the source of counting.

Be aware, that you can't disable all (four)HW prefetchers form BIOS.

VTune doesoffer a direct counter for measuring average memory latency during execution of a real application. For Core2 microarchitecture it's MEM_LOAD_RETIRED.L2_LINE_MISS event. But the penalty values evaluated in with thisevent are only approximations.