I am trying to optimize my applications running on Intel's platform. I use the tool named toplev, which implement the TMAM to analysis the performance of the applications. (Just like the VTune in Windows) . I read the source file of ivb_client_ratios.py, which is used to define the metrics for Ivybridge platform. In L1_Bound, there is one metric named DTLB_Load. As I know, L1_Bound estimates how often the CPU was stalled without
loads missing the L1 data cache. So the key point here is 'stalled', in my understanding, it count the cycles only when RS dispatch is pending, and the pending reason is L1D hit. But when I get the metric named DTLB_Load, it use the formula to get its vaue, which is (Mem_STLB_Hit_Cost * EV("DTLB_LOAD_MISSES.STLB_HIT", 4) + EV("DTLB_LOAD_MISSES.WALK_DURATION", 4)) / CLKS(self, EV, 4 ). It can't see any stall information from this formula, does it mean that this metric include the whole DTLB cycles, not only the cycles seen when dispatch is pending? But I think the whole memory bound metris must be counted when dispatch is pending, right?
Unambiguous attribution of stall cycles to specific events is seldom possible in modern OOO processors. There are too many special cases (many of which involve undocumented features of the implementation), too many cases of overlapping causes, and too many bugs in the various counters for this to be even close to precise.
I am sure you are aware of the Intel performance counter event CYCLE_ACTIVITY (0xA3), which has the ability to count cycles for which there is both a "dispatch stall" (i.e., no micro-ops issued to any port in that cycle) and there is a demand load miss outstanding at some level of the memory hierarchy. For the case of an L1 miss/L2 hit, there is a fair chance that the processor can do enough processing out of order to avoid stalls -- so stalls that do occur are not necessarily caused by the L1 load miss. For the case of an L1 miss/L3 hit, there are a lot more cycles of waiting, and a much higher probability that any stalls observed were due to the load miss. For the case of an L1 miss/L3 miss, there are typically a few hundred cycles of waiting, and it is reasonable to assume that most of the stalls incurred were due to the load miss. But these are probabilities, and not precise measurements. I have seen only one case where the results looked "exact", but that was a pointer-chasing code that only had one instruction that could be executed at a time (due to dependencies). It took 163 cycles per load, and the stall counter reported 162 cycles of "stalls" per 163 cycles. Unfortunately it does not take of lot of complexity in the code to make the results very hard to interpret quantitatively.
Historically speaking, it has been hard for OOO processors to overlap a lot of work with DTLB misses (especially when they also miss in the STLB). Because of this, I would be willing to use DTLB_LOAD_MISSES.WALK_DURATION as a proxy for the "cost" (in stall cycles) associated with DTLB misses. (It will be biased high because it ignores overlaps, but you might try to partially compensate for this by subtracting a fixed number of cycles per DTLB_LOAD_MISSES event to account for the portion of each Page Table Walk that might be overlapped with OOO work.)
For IVB and newer processors, it is important to be aware that the "Next Page Prefetcher" will often cause a Page Table Walk in advance of the stream of loads, so that you don't get any DTLB_LOAD_MISSES events for contiguous access streams. This is good news, but can be confusing if you don't realize that it is happening. You don't want to count the page table walk time for these "prefetch" table walks because they are unlikely to cause stalls. Fortunately the DTLB_LOAD_MISSES.WALK_DURATION only counts duration for table walks due to demand loads and not the duration for table walks due to the next page prefetcher, so it remains the best available counter.