Hello. I'm working on Intel Xeon CPU E7-4850 v2 @ 2.30GHz.
I want to break down program execution time into computation time, memory stalls, branch mispredictions, and resource stalls.
To achieve this, branch misprediction penalty, cache miss penalties, and TLB miss penalties are needed. (For example, L1D cache latency is about 4 cycles, and so on..)
How can I know those penalties of the processor I am working on?
TMA (Top-Down Microarchitecture Analysis) methodology is probably what you are looking for.
Latest spreadsheet with formulas: https://download.01.org/perfmon/TMA_Metrics.xlsx
Tools I know of which implement this methodology:
- Perf toplev (https://github.com/andikleen/pmu-tools/wiki/toplev-manual)
Thank you for your kind answer!
I have three more questions.
(I am working on Intel(R) Xeon(R) Processor E7-4850 v2 which code name is Ivy Bridge)
1) In the spreadsheet you gave, if an entry of left column is empty, does it mean I can use the entry of the right column?
For example, “L3_Bound” entry of IVB is empty. Does it mean that the formula of L3_Bound in IVB is same with the formula of L3_Bound in SNB?
2) I calculated L2 bound using the formula in the spreadsheet (L2 Bound = (CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS), but I got a negative value for L2 bound.
Specifically, followings are the counters that I've got.
Why did this happen? Is it a measurement error caused by sampling? Can I just ignore the negative value and use 0%?
(When I use general exploration mode in VTune, it shows L2 bound 0%)
3) I want to calculate the execution time excluding every stall time (including memory stall and resource stall).
Is the equation “CPU_CLK_UNHALTED.THREAD - CYCLE_ACTIVITY.STALLS_TOTAL - RESOURCE_STALLS.ANY” right for measuring the total cycle excluding stall cycles?
1) Yes if cell is empty you should use the closest non-empty one on the right
2) Yes if the value is <0 you can consider it as 0. This is what VTune does
3) Probably yes. But I'm not completely sure that CYCLE_ACTIVITY.STALLS_TOTAL and RESOURCE_STALLS.ANY don't intersect. Note that this will give you non-stalled cycles, not the wall time.
Can I ask two more questions?
1) When I used general exploration analysis of VTune, I got the result as follows.
- Memory Bound: 4.2%
- L1 Bound: 2.7%
- L2 Bound: 1.4%
- L3 Bound: 2.6%
- DRAM Bound: 2.1%
- Store Bound: 8.0%
The sum of the values of lower-level items is not the same as the value of the higher-level item.
Is it just an error caused by sampling? Or, is it because lower-level items overlap each other?
2) Does "L1 Bound" in general exploration analysis of VTune include stall times other than pure cache/memory latencies? (e.g., a stall time for data dependency)
( #STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_PENDING ) / CLKS)
I want to calculate pure cache latencies, but the optimization manual says, "Yet in certain cases, like loads blocked on older stores, a load might suffer high latency while eventually being satisfied by the L1."
If L1 Bound includes other kinds of stall times, how can I measure pure L1 cache costs?
1) These metrics are in different units. Memory Bound is measured in Pipeline Slots while their children are measured in Cycles. The metrics in Pipeline Slots (top 2 levels) are perfectly sum up to parent while metrics in cycles often don't. The main reasons are stalls overlap (as you pointed) and lack of accurate monitoring capabilities in CPU which means the metrics are often an estimates (sometimes quire rough).
2) Yes the L1 Bound can include e.g. stalls due to TLB miss, etc. Each cache level has a minimal latency defined by microarchitecture. And there could be many reasons for latency to be bigger than minimal. E.g. the nodes under L1 Bound should represent most of the reasons why loads from L1 can took more than minimal 4 cycles. If you provide more details on what problem you are trying to solve I may be able to give more advices.
I am trying to find the main reason that causes performance difference between two different implementations.
To do this, I am trying to break down execution time into computation time (minimal estimated time based on # of microoperations,) stall time related to memory hierarchy (excluding any other stall times,) branch misprediction penalty, and resource stall time (including stall time due to functional unit unavailability, dependencies among instructions, and platform-specific characteristics.) 
Since I am interested in memory stall time, I want to further break down memory stall time into L1-D cache hit latency, L1-I cache hit latency, L2 cache hit latency, L3 cache hit latency, DRAM latency, DTLB latency and ITLB latency.
1)Can I get the pure memory stall time excluding any other stall time by subtracting some metrics, or calculating (# of misses)*(minimal latency)?
- If the former is possible, could you give me the set of equations for each stall time?
- If the latter is possible, could you give me the equations to get # of misses and minimal latency of each cache level (of code name Ivytown)? (Since there are too many hardware counters related to cache misses, I don't know which metrics I have to use to get the number of cache misses.)
2) How can I calculate computation time, branch misprediction penalty, and resource stall time in the same sense?
 Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D.A., "DBMSs On A Modern Processor: Where Does Time Go?," In Proceedings of the 25th International Conference on Very Large Data Bases, pp. 266-277, Sept. 1999.
I'm afraid it unlikely possible accurately. The TMA actually tries to do this - e.g. it breaks L1 Bound by DTLB, store forward, slit loads, etc. But most these metrics are (rough) estimates.
To summarize - I would suggest to use TMA. It is the best breakdown you can get at this point.