topic Hi John, in Software Tuning, Performance Optimization & Platform Monitoring

interpretation of dtlb_load_misses.demand_ld_walk_duration and dtlb_store_misses.walk_duration

gostanian__richard — Wed, 19 Feb 2020 01:16:26 GMT

Hi Everyone,

The documentation for

dtlb_load_misses.demand_ld_walk_duration on Haswell says

[Demand load cycles page miss handler (PMH) is busy with this walk]

Whereas the documentation for dtlb_store_misses.walk_duration says

[Cycles when PMH is busy with page walks]

I puzzled by the terminology "busy with this walk" vs "busy with page walks".

Should they both say "busy with page walks"?

So if I run

perf stat -e cycles,instructions,dtlb_load_misses.walk_duration

on a given command and get

 Performance counter stats for 'system wide':

   291,350,355,880      cycles                                                      
    36,361,479,212      instructions              #    0.12  insn per cycle                                            
    30,179,920,415      dtlb_load_misses.walk_duration                                   
        43,668,809      dtlb_store_misses.walk_duration                                   

      96.873898071 seconds time elapsed

does this mean that I'm spending 30,179,920,415 + 43,668,809 cycles out of 291,350,355,880 cycles on page table walking for dtlb misses? If so then am I spending 10% of the total time page table walking. Is this correct?

The terminology of these

McCalpinJohn — Wed, 19 Feb 2020 22:30:44 GMT

The terminology of these events can be frustrating -- it is always hard to tell if different words mean something different, or if they were just changed to add variety to the documentation....

I don't see an event named "dtlb_load_misses.demand_ld_walk_duration" in any Intel documentation -- where did you find that name?

Section 19.7 of Volume 3 of the Intel SW Developer's Manual says that on Haswell, the event DTLB_LOAD_MISSES.WALK_DURATION (Event 0x08, Umask 0x10) measures "Cycle PMH is busy with a walk", while the event DTLB_STORE_MISSES.WALK_DURATION (Event 0x49, Umask 0x10) measures "Cycles PMH is busy with this walk". This may mean exactly the same thing, or it may be a way to avoid saying that the DTLB_LOAD_MISSES.WALK_DURATION might be contaminated by cycles that the PMH is executing walks on behalf of the Next-Page-Prefetcher (which was introduced in Ivy Bridge, and is the subject of almost no official documentation). On Haswell, my testing indicates that the event PAGE_WALKER_LOADS increments for both walks due to demand loads/stores and walks due to the next-page-prefetcher. Differences between the sum of ITLB_MISSES, DTLB_LOAD_MISSES, and DTLB_STORE_MISSES events and the counts from PAGE_WALKER_LOADS can be used to infer the presence of next-page-prefetcher activity. I don't know if anyone has done systematic testing, but I found that if I load data from every other 4KiB page, the number of DTLB_LOAD_MISSES is cut in half, but the total number of PAGE_WALKER_LOADS is the same (since the next-page-prefetcher loads the page translations that I skip over).

Hi John,

gostanian__richard — Thu, 20 Feb 2020 18:12:04 GMT

Hi John,

The "dtlb_load_misses.demand_ld_walk_duration" is one of the Ivy Bridge tlb events you get if you do

4x10x2 > perf list |grep tlb
  mem_uops_retired.stlb_miss_loads                  
  mem_uops_retired.stlb_miss_stores                 
  dtlb_load_misses.demand_ld_walk_completed         
  dtlb_load_misses.demand_ld_walk_duration      << ====================      
  dtlb_load_misses.large_page_walk_completed        
  dtlb_load_misses.miss_causes_a_walk               
  dtlb_load_misses.stlb_hit                         
  dtlb_load_misses.walk_completed                   
  dtlb_load_misses.walk_duration                    
  dtlb_store_misses.miss_causes_a_walk              
  dtlb_store_misses.stlb_hit                        
  dtlb_store_misses.walk_completed                  
  dtlb_store_misses.walk_duration                   
  itlb.itlb_flush                                   
  itlb_misses.large_page_walk_completed             
  itlb_misses.miss_causes_a_walk                    
  itlb_misses.stlb_hit                              
  itlb_misses.walk_completed                        
  itlb_misses.walk_duration                         
  tlb_flush.dtlb_thread                             
  tlb_flush.stlb_any

To get the corresponding Intel event probably requires looking at the perf code.

"DTLB_LOAD_MISSES.DEMAND_LD

McCalpinJohn — Thu, 20 Feb 2020 22:41:04 GMT

"DTLB_LOAD_MISSES.DEMAND_LD_WALK_DURATION" is a name used by OProfile for Ivy Bridge, where it is listed as using Umask=0x84. https://oprofile.sourceforge.io/docs/intel-ivybridge-events.php). This name and Umask is also used by the Intel documentation at https://download.01.org/perfmon/IVT/ivytown_core_v20.json, but only for IvyTown -- not for any other processor model.

Table 19-15 of Volume 3 of the SWDM says that Event 0x08, Umask 0x84 counts "cycles PMH is busy with a walk due to demand loads". BUT, comparing the DTLB_LOAD_MISSES (Event 0x08) encodes from Ivy Bridge (Table 19-15) and Haswell (Table 19-11) strongly suggests that the encodings for these masks have changed. Curiously, there are no sub-events that use exactly the same Umask across these two tables, but sub-events that use very similar words have very different Umask encodings. A change in encoding is often an indication that something important has changed in the definitions of the events -- so every variation of the event has to be re-tested against a carefully constructed set of microbenchmarks....

The answer to the original query ("am I spending 10% of my time in table walking?") is probably, but not definitely, "yes".

The change in wording (dropping the term "demand loads" in the "duration" sub-event) remains a concern. It should be possible to create a fairly simple set of tests that will disambiguate these issues. I would recommend measuring all documented sub-events of DTLB_LOAD_MISSES, DTLB_STORE_MISSES, and PAGE_WALKER_LOADS against a few test patterns:

Contiguous loads of an array mapped to 4KiB pages
- Small: fits in the 64 entries of the DTLB for 4KiB pages -- e.g., 200-240KiB
- Medium: fits in the 1024 entries of the STLB for 4KiB pages -- e.g., 500-600KiB
- Large: much larger than the 1024 entries of the STLB for 4KiB pages -- e.g., 40MiB (10x)
Contiguous loads of an array mapped to 2MiB
- Small: fits in the 32 entries of the DTB for 2MiB pages -- e.g., 32 MiB
- Medium: fits in the 1024 entries of the STLB for 2MiB pages -- e.g., 512 MiB (8x larger than the DTLB range)
- Large: much larger than the 1024 entries of the STLB for 2MiB pages -- e.g., 20GiB (10x)
Repeat the above tests, but read only every other 4KiB (aligned) region.
- Use both the original size (same number pages in the array) and twice the original size (same number of pages actually accessed)
Repeat the above tests, but read only the first cache line from each 4KiB (aligned) region.
Repeat the above tests, but read only the first cache line from every other 4KiB (aligned) region.
- Use both the original size (same number pages in the array) and twice the original size (same number of pages actually accessed)

Although nothing ever works out quite as expected, one would hope that (compared to the number of pages accessed), the "small" cases would have a very close count of DTLB hits, the "medium" cases would have most of the expected counts misses in the DTLB and hitting in the STLB hit, and the "large" cases would have most counts missing both DTLB and STLB and causing walks. The tests using every other 4KiB page should show whether the TLB lookups created by the Next-Page-Prefetcher are included in the counts. (I expect them in PAGE_WALKER_LOADS and not in the DTLB_LOAD_MISSES event.). Reading only one cache line from each 4KiB page should minimize the probability that the next-page-prefetcher is activated, and reading only one cache line from every other 4KiB page should (fingers crossed) never cause the next-page-prefetcher to activate.

There is not much use in using performance counter event names provided by perf -- the translations may change between kernel revisions, and may mean different things on different processors. It only takes looking up a few of these events to find errors in the events used. The location of these events in the kernel source tree also seems to move about randomly from one kernel version to the next.