I'm measuring the number of TLB misses for a simple microbenchmark program and I noticed for large memory segments backed by 4Kb pages, the counts for DTLB_LOAD_MISSES.WALK_COMPLETED are larger than what I would expect. In fact, the count is sometimes more than double what should occur...but this only happens when I increase the segment size to be greater than 16Gb.
I noticed that I could also use MEM_UOPS_RETIRED.STLB_MISS_LOADS to measure something similar. What I noticed is that the two counters are identical up until the segment size reaches 16Gb. From that point the two diverge, here's the data
size | wallk_compled/stlb_miss
Essentially its growing linearly as I increase the size of the segment. I've tried disabling HT and HW prefetching in the BIOS with no luck.
My main questions are: (1) What is the difference between those two counters?, and (2) What (beside the explicit instructions or mem-uops from my code) could cause an extraTLB miss/walk?
I'm using a Haswell E5-2699 and reading counters with perf using the :u modifier. Thanks for any help!
Have you tried looking at the absolute numbers of TLB misses for simple (e.g., contiguous) access patterns as a function of array size? Given the very large array sizes, I would expect contiguous accesses to flush the DTLB and STLB, leading to one DTLB miss per 4 KiB, with close to 100% STLB misses.
The Umask values for the Haswell are a little weird -- for Event 0x08, Umask 0x02 is WALK_COMPLETED_4K, Umask 0x04 is WALK_COMPLETED_2M_4M, so I would have guessed that the Umask for all WALK_COMPLETED events would be 0x06. Instead it is 0x0E, which implicitly includes Umask 0x08, which is not defined. There does not appear to be much help in looking at the definitions in similar processors -- the Umasks for this event seem to be different for each processor model.
I am having some trouble getting my test code to compile with these large array sizes, so I can't test this tonight. Maybe tomorrow...
A few smaller tests provide some interesting results....
First, I should note that TLB behavior on Haswell may be entirely different than TLB behavior on Sandy Bridge. My Sandy Bridge systems support "Process Context Identifiers" (PCIDs), but the version of Linux that we are running (CentOS 6.5, kernel 2.6.32-431) does not use them. On the other hand, my Haswell system supports both PCIDs and the INVPCID instruction, and the version of Linux that we are running (CentOS 6.6, kernel 2.6.32-504) does enable them. As discussed in Section 4.10 of Volume 3 of the SWDM, the use of PCIDs completely changes the way Page Table Entries (and other page translation structures) are cached. It is easy to imagine that these changes could result in broken performance counter events, or correct events that don't correspond to what the hardware is doing in the way that we expect.
So with that caveat, on to some results...
I have a simple code that repeatedly sums up the elements of an array of doubles, with TSC and performance counter reads around each iteration.
On a Xeon E5-2660 v3 system, I set up the performance counters to measure 8 events:
Counter 0: 0x00430108 DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -- any page size
Counter 1: 0x00430208 DTLB_LOAD_MISSES.WALK_COMPLETED_4K -- completed walks only
Counter 2: 0x00430408 DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M -- completed walks only
Counter 3: 0x00430E08 DTLB_LOAD_MISSES.WALK_COMPLETED -- completed walks only
Counter 4: 0x00432008 DTLB_LOAD_MISSES.STLB_HIT_4K -- no page walk
Counter 5: 0x00434008 DTLB_LOAD_MISSES.STLB_HIT_2M -- no page walk
Counter 6: 0x00436008 DTLB_LOAD_MISSES.STLB_HIT -- no page walk
Counter 7: 0x00438008 DTLB_LOAD_MISSES.PDE_CACHE_MISS -- should be interesting!
LARGE PAGE TESTS:
For the first set of tests, the main array is mapped on 2MiB large pages (using mmap() with the MAP_HUGETLB option). According to the Optimization Reference Manual, the Haswell DTLB has 32 entries for large pages, so it should be able to map 64 MiB.
For an array size of 8388608 elements (64 MiB), I get zero counts for all 8 events above in most iterations. This is good, because the DTLB should be able to map the entire array.
For array sizes of 128 MiB, 256 MiB, and 512 MiB, I get zero counts for most of the events, but events 5 & 6 average 10-11 increments per iteration -- independent of array size.
This is a bit odd.
Counter 5 measures DTLB miss with STLB hit for 2 MiB pages. The STLB is reported to hold 1024 entries and is "shared by 4KB and 2/4MB pages". So unlike previous systems, Haswell can keep 2MiB page translations in the STLB. It is not clear whether it can hold 1024 2MiB page entries, but my largest test only needs 256 entries, so I expect hits in the STLB.
The problem is with the number of STLB hits. The 10-11 increments should only correspond to loading 20-22 MiB, while my code loads up to 512 MiB in each iteration. I would have expected to see 64, 128, and 256 STLB hits for the three larger tests.
The observation that the number of DTLB misses is not growing suggests that either:
SMALL PAGE TESTS: (((( RESULTS BOGUS --- CORRECTED IN A LATER NOTE ))))
Re-running the tests on small pages makes the results even stranger.
Most of the counts are still zero, which is very strange, since the STLB should only be able to map 4MiB with 4KiB pages, and I am accessing up to 128 times that much memory in each iteration.
Every case has 10-13 increments in Counters 4 and 6 (DTLB_LOAD_MISSES.STLB_HIT_4K and DTLB_LOAD_MISSES.STLB_HIT). This includes the 64 MiB case (which had no DTLB misses in that large page case).
The expected number of DTLB misses for these four cases are 16k, 32k, 64k, and 128k -- not 10-13.
Again this suggests that either the hardware counter event has a significant undercounting problem, or the hardware has an extraordinarily efficient TLB prefetching mechanism.
Firstly, thanks for looking into this. :)
Those results are quite different than what we're currently seeing...we have no trouble with seeing a large number of TLB misses once we begin to overwhelm the STLB.
I just took your advice and rewrote our test program to sequentially access the array and the DTLB_LOAD_MISSES.WALK_COMPLETED are far closer to what I would expect. Just about one for every "op" (a load/store of the array). Though they still diverge a bit from the MEM_UOPS_RETIRED.STLB_MISS_LOADS counter by +14% in the worst case (256GB array size).
The problem is I need a random access pattern for my experiment. :) Previously I had allocated a single 2Mb page for the random array, filled it with rand(). The array index I will load/store is computed using two of these random numbers...this helps ensure I hit all elements in the array. Long story short, I do two reads of the rand array every time I touch the main memory segment. BUT...it should matter w.r.t to the TLB because the rand array is backed by a 2Mb page and the array is 4Kb pages...so the rand array TLB entry should always be sitting in the dedicated 2MB TLB. I confirmed this by looking at the DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M counter and there are almost no misses there. So the rand array is NOT the culprit. :)
Any other ideas?
Ooops -- the "small page" results above are bogus -- I forgot to disable transparent hugepages on this system.
After fixing this, I re-ran the tests with the data allocated without the MMAP_HUGETLB option. I still get mostly zero counts counts from the DTLB_LOAD_MISSES.* events and very small counts from the MEM_UOPS_RETIRED.STLB_MISS_LOADS counter (about 1/600th of the expected value), but I do get excellent results from the PAGE_WALKER_LOADS.DTLB_* events.
The events that give good numbers in this test are:
# Counter 4: 0x004311bc PAGE_WALKER_LOADS.DTLB_L1 -- number of DTLB page walker loads that hit in the L1+FB
# Counter 5: 0x004312bc PAGE_WALKER_LOADS.DTLB_L2 -- number of DTLB page walker loads that hit in the L2
# Counter 6: 0x004314bc PAGE_WALKER_LOADS.DTLB_L3 -- number of DTLB page walker loads that hit in the L3
# Counter 7: 0x004318bc PAGE_WALKER_LOADS.DTLB_MEMORY -- number of DTLB page walker loads that hit in Memory
The sum of the four values is between 0.1% and 0.3% higher than the expected value of one DTLB walk per 4KiB page loaded.
The first event dominates, with a very steady 87.3% hit rate --- almost exactly matching the 87.5% hit rate expected for contiguous Page Table Entries that are packed 8 to a cache line (and not displaced before they are used).
The remaining table walks are spread across the L2 (4.1%), L3 (7.3%), and Memory (1.3%).
These ratios apply to all four data set sizes I picked -- 64 MiB, 128 MiB, 192 MiB, and 256 MiB. (In my previous note I listed the wrong values for the 3 larger cases).
So I am very happy with the PAGE_WALKER_LOADS events (Event 0xBC), at least for small pages. Over the weekend I will probably track down the bug in my code that is preventing me from running the large page tests with sizes of 2GiB or larger, and then I will be able to test these events with large pages.
Can you check on your system and see if /proc/cpuinfo contains the "pcid" attribute in the "flags" line?
Just to complete my review of the Event 0x08 DTLB_LOAD_MISSES events, I re-ran my small page tests and captured the DTLB_LOAD_MISSES.WALK_DURATION counts. Like the other DTLB_LOAD_MISSES counts, the results here are far too low.
As an example, the 64 MiB case showed very close to the expected 16384 TLB misses with the sum of the four PAGE_WALKER_LOADS.* events, but the corresponding DTLB_LOAD_MISSES.WALK_DURATION counts were only about 1000. Obviously you can't service 16,000 TLB misses in 1000 cycles. Adding the approximate latencies for each cache/memory level from the PAGE_WALKER_LOADS.* events gives a lower bound estimate of about 160,000 cycles for the total page walk duration. This value does not include any page walker overhead, so the true value is probably quite a bit higher than 160,000 cycles for this test.
These tests are all long enough to include multiple Linux kernel timer interrupts, so I measured the DTLB_LOAD_MISSES events MISS_CAUSES_A_WALK, WALK_COMPLETED, STLB_HIT, and PDE_CACHE_MISS separately for user and kernel space (setting bits 16 and 17 of the PMC control registers independently). For the N=32Mi (256 MiB) case, only 101 of the 2420 measurements (4.2%) showed any kernel space activities (even though all of the cases ran for at least 24 milliseconds), so the tiny numbers of events that are reported are all user-space activity. Average counts for the 256MiB case were:
The WALK_COMPLETED counts are about 1000x smaller than expected (and 1000x smaller than the sum of the PAGE_WALKER_LOADS.* counts).
It might be interesting to boot my system with the Process Context Identifiers disabled and see if that makes any difference to the TLB counts. Not sure that I will have time to do that this week....
If I had to guess, what you are seeing are the results of the NPP (next page prefetch). This details on this prefetcher are very murky - I've asked about it before without much luck and searches don't turn up much. It would explain what you are seeing though, at least for a particular NPP implementations.
For example does the NPP just prefetch the data into L1/L2/L3 if the next page is already mapped, or can it also trigger a page walk? My guess is the latter, otherwise the NPP would be quite useless - streaming workflows would simple bottleneck behind TLB misses rather than conventional D$ misses.
So if the NPP triggers page walks your results make sense. Your workflow is streaming and so the next page is predictable. The NPP triggers the page walk ahead of the actual first load of the page (as good prefetchers do) - so you don't get many DTLB_LOAD miss type events. You get a few STLB hits when you exceed the TLB capacity because prefetching isn't perfect and/or you have some startup behavior before the NPP get rolling (the latter being more appealing since your STLB hits stayed constant even with larger working sets).
So that kind of explains the counters that track demand TLB misses. The PAGEWALKER counters however aren't tied to demand loads, so it makes sense they track all PW activity regardless of whether its triggered on demand or by the NPP. So you get the full expected counts there.
You could perhaps validate this by defeating the prefetches (perhaps turning off one or more if the known D$ prefetchers does this) or with another test that doesn't have a predictable read pattern.
I had already seen that disabling the four documented hardware prefetch engines does not change the behavior of the DTLB_LOAD_MISSES.* (Event 0x08) counts, but I forgot about the Next-Page-Prefetcher in this context.
I had originally assumed that the NPP would not cause table walks, but if it is capable of table walks then it certainly could account for the observed behavior. Since there is no documentation on disabling this function, I will have to try permuting the addresses to see if I can achieve consistency between the DTLB_LOAD_MISSES counts and the PAGE_WALKER_LOADS events.
I modified my code so that it would load all of the even-numbered pages on the first pass and all of the odd-numbered pages on the second pass. E.g., for the 64 MiB case, it loads page 0, 2, 4, ..., 8192, then 1, 3, 5, ... 16383. So if the next page prefetcher brings in the TLB entry for the page after the current page, it won't get used -- I will access 8192 different pages before I get around to that adjacent page.
This change to the code brought the DTLB_LOAD_MISSES.WALK_COMPLETED and MEM_UOPS_RETIRED.STLB_MISS_LOADS up to the expected level of 1 event per 4KiB loaded. This strongly suggests that it was a hardware prefetch mechanism that was causing the TLB entries to be pre-loaded before the code actually tried to load the page for the first time.
Even more interesting: the sum of the PAGE_WALKER_LOADS.* events is doubled in runs with the 2-pass code. This suggests that it is indeed the Next Page Prefetcher that is causing the TLB prefetches. In the original code each prefetched TLB entry was used very quickly, while in this modified code the prefetched TLB entry is not used for a long time, so it is flushed from the TLB and caches by the time the page is actually accessed (in the next even/odd pass) and must be re-loaded.
The DTLB_LOAD_MISSES.WALK_DURATION counter shows good agreement with the expected walk duration if I make 2 assumptions:
Assuming latencies of 4/12/40/230 cycles for L1/L2/L3/Memory, I get good agreement between the reported page walk duration and my model if I assume a fixed overhead of 14 cycles. This seems consistent with the page table walker overheads that I have looked at in the past.
So my measurements all make sense now.
Unfortunately I never saw more than 1% difference between DTLB_LOAD_MISSES.WALK_COMPLETED and MEM_UOPS_RETIRED.STLB_MISS_LOADS, so I have not gained any insight into the source of the discrepancy that started this thread....
Well even though you didn't find the source of the discrepancy, you can console yourself with the fact that you've already uncovered more information about NPP than I've seen in one place before. As I recall, the NPP pretty much gets a one line, information-free mention in the latest optmization manuals, so this is already very useful.
At a minimum, based on your tests, it seems like we can speculate that:
I'm not surprised the NPP triggers a page walk. If it did not, the idea of NPP would be very limited - it would work only in the limited case that the page data itself is not in the D-cache (whatever level the NPP fetches too), but the TLB entries exist (at least in the STLB). Given the fairly limited coverage the TLB provides for 4k pages, that is a fairly restricted scenario, and in particular for any type of big streaming (10+ MB) streaming workflow it would fail completely, and that's where NPP is most useful.
I think the NPP serves, possibly in concert with the "fast page start" prefetch stuff to largely mitigate the penalty of 4k pages wrt to continually crossing into the next 4k page while streaming. It's not clear to me if fast page start is distinct from NPP or an alternate implementation, etc.
The DTLB_LOAD_MISSES.WALK_COMPLETED event probably does only count the times that the page table walker is able to complete a page walk without raising an exception.
It is less clear how the DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK event counts in cases where an exception is raised and the offending instruction is restarted after the OS fixes whatever restartable exception caused the fault (e.g., instantiating a new page or loading a swapped page from disk).
It should be easy enough to measure the various TLB and page table walker events in user and system mode for a tight loop that only touches new pages. This should help clarify at least one of the exceptional cases.
John, the main difference between COMPLETED and CAUSE's are that CAUSE counts all walks, even those instructions that don't retire. If NPP walks are not counted in COMPLETED, I wouldn't be surprised to see them show up in the CAUSE counters.
Good point -- if a speculative load starts a table walk, then the table walk should be cancelled if the load is cancelled.
The event DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK does not appear to count walks that are due to the Next Page Prefetcher. This is consistent with it being filtered by DTLB misses that are triggered by load instructions. The PAGE_WALKER_LOADS counters appear to count table walks due to both demand activity and the next page prefetcher.
just to be sure I got it: page table walk started by a speculative load (i.e. speculative execution of load operations due, for instance, to branch prediction implemented by processor front-end) does not increment DTLB_LOAD_MISSES.WALK_COMPLETED counter if it does not retire (due, for instance, to a branch misprediction)
Yes, that is how I interpreted Tim M's comment. I have not tested the hypothesis, but it seems like a reasonable behavior for the chip and for the performance counter event.