The documentation for
dtlb_load_misses.demand_ld_walk_duration on Haswell says
[Demand load cycles page miss handler (PMH) is busy with this walk]
Whereas the documentation for dtlb_store_misses.walk_duration says
[Cycles when PMH is busy with page walks]
I puzzled by the terminology "busy with this walk" vs "busy with page walks".
Should they both say "busy with page walks"?
So if I run
perf stat -e cycles,instructions,dtlb_load_misses.walk_duration
on a given command and get
Performance counter stats for 'system wide': 291,350,355,880 cycles 36,361,479,212 instructions # 0.12 insn per cycle 30,179,920,415 dtlb_load_misses.walk_duration 43,668,809 dtlb_store_misses.walk_duration 96.873898071 seconds time elapsed
does this mean that I'm spending 30,179,920,415 + 43,668,809 cycles out of 291,350,355,880 cycles on page table walking for dtlb misses? If so then am I spending 10% of the total time page table walking. Is this correct?
The terminology of these events can be frustrating -- it is always hard to tell if different words mean something different, or if they were just changed to add variety to the documentation....
I don't see an event named "dtlb_load_misses.demand_ld_walk_duration" in any Intel documentation -- where did you find that name?
Section 19.7 of Volume 3 of the Intel SW Developer's Manual says that on Haswell, the event DTLB_LOAD_MISSES.WALK_DURATION (Event 0x08, Umask 0x10) measures "Cycle PMH is busy with a walk", while the event DTLB_STORE_MISSES.WALK_DURATION (Event 0x49, Umask 0x10) measures "Cycles PMH is busy with this walk". This may mean exactly the same thing, or it may be a way to avoid saying that the DTLB_LOAD_MISSES.WALK_DURATION might be contaminated by cycles that the PMH is executing walks on behalf of the Next-Page-Prefetcher (which was introduced in Ivy Bridge, and is the subject of almost no official documentation). On Haswell, my testing indicates that the event PAGE_WALKER_LOADS increments for both walks due to demand loads/stores and walks due to the next-page-prefetcher. Differences between the sum of ITLB_MISSES, DTLB_LOAD_MISSES, and DTLB_STORE_MISSES events and the counts from PAGE_WALKER_LOADS can be used to infer the presence of next-page-prefetcher activity. I don't know if anyone has done systematic testing, but I found that if I load data from every other 4KiB page, the number of DTLB_LOAD_MISSES is cut in half, but the total number of PAGE_WALKER_LOADS is the same (since the next-page-prefetcher loads the page translations that I skip over).
The "dtlb_load_misses.demand_ld_walk_duration" is one of the Ivy Bridge tlb events you get if you do
4x10x2 > perf list |grep tlb mem_uops_retired.stlb_miss_loads mem_uops_retired.stlb_miss_stores dtlb_load_misses.demand_ld_walk_completed dtlb_load_misses.demand_ld_walk_duration << ==================== dtlb_load_misses.large_page_walk_completed dtlb_load_misses.miss_causes_a_walk dtlb_load_misses.stlb_hit dtlb_load_misses.walk_completed dtlb_load_misses.walk_duration dtlb_store_misses.miss_causes_a_walk dtlb_store_misses.stlb_hit dtlb_store_misses.walk_completed dtlb_store_misses.walk_duration itlb.itlb_flush itlb_misses.large_page_walk_completed itlb_misses.miss_causes_a_walk itlb_misses.stlb_hit itlb_misses.walk_completed itlb_misses.walk_duration tlb_flush.dtlb_thread tlb_flush.stlb_any
To get the corresponding Intel event probably requires looking at the perf code.
"DTLB_LOAD_MISSES.DEMAND_LD_WALK_DURATION" is a name used by OProfile for Ivy Bridge, where it is listed as using Umask=0x84. https://oprofile.sourceforge.io/docs/intel-ivybridge-events.php). This name and Umask is also used by the Intel documentation at https://download.01.org/perfmon/IVT/ivytown_core_v20.json, but only for IvyTown -- not for any other processor model.
Table 19-15 of Volume 3 of the SWDM says that Event 0x08, Umask 0x84 counts "cycles PMH is busy with a walk due to demand loads". BUT, comparing the DTLB_LOAD_MISSES (Event 0x08) encodes from Ivy Bridge (Table 19-15) and Haswell (Table 19-11) strongly suggests that the encodings for these masks have changed. Curiously, there are no sub-events that use exactly the same Umask across these two tables, but sub-events that use very similar words have very different Umask encodings. A change in encoding is often an indication that something important has changed in the definitions of the events -- so every variation of the event has to be re-tested against a carefully constructed set of microbenchmarks....
The answer to the original query ("am I spending 10% of my time in table walking?") is probably, but not definitely, "yes".
The change in wording (dropping the term "demand loads" in the "duration" sub-event) remains a concern. It should be possible to create a fairly simple set of tests that will disambiguate these issues. I would recommend measuring all documented sub-events of DTLB_LOAD_MISSES, DTLB_STORE_MISSES, and PAGE_WALKER_LOADS against a few test patterns:
Although nothing ever works out quite as expected, one would hope that (compared to the number of pages accessed), the "small" cases would have a very close count of DTLB hits, the "medium" cases would have most of the expected counts misses in the DTLB and hitting in the STLB hit, and the "large" cases would have most counts missing both DTLB and STLB and causing walks. The tests using every other 4KiB page should show whether the TLB lookups created by the Next-Page-Prefetcher are included in the counts. (I expect them in PAGE_WALKER_LOADS and not in the DTLB_LOAD_MISSES event.). Reading only one cache line from each 4KiB page should minimize the probability that the next-page-prefetcher is activated, and reading only one cache line from every other 4KiB page should (fingers crossed) never cause the next-page-prefetcher to activate.
There is not much use in using performance counter event names provided by perf -- the translations may change between kernel revisions, and may mean different things on different processors. It only takes looking up a few of these events to find errors in the events used. The location of these events in the kernel source tree also seems to move about randomly from one kernel version to the next.