John, Thank you for the reply

CPark16 · ‎05-29-2016

Hi, I was conducting some experiments on my 4770 Haswell processor, and I was using linux perf stat tool to count the TLB miss and TLB page walker cache misses.

I got some confusing results and decided to ask this question.

(If the question belongs someplace else, please notify me, I'll move it to wherever it is appropriate)

Before I get into the details of my problem, the Paging_structure Caches detailed in 4.10.3 of SDM Vol3

provides details that the caching structure is arranged into three levels, the PML4 cache, PDPTE cache and PDE cache. (For the Top 3 levels of the 4 level page walk).

Now when I counted the performance counter events for the two following events I got an interesting result

DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK
DTLB_LOAD_MISSES.PDE_CACHE_MISS

The first event counts

Misses in all TLB levels that cause a page walk of any page size.

Thus it counts all misses in the L2 TLB that causes a HW page walk, which will check the page caching structures (or the PML4, PDPTE, PDE caches)

The second event counts

DTLB demand load misses with low part of linear-to- physical address translation missed

However, to my surprise, for some workloads, I get up to 1.7x PDE_CACHE_MISSES compared to MISSES_CAUSES_A_WALK

Now my question: Does the PDE_CACHE_MISSES count misses in the PDPTE & PML4 caches as well?

If the answer to my question is yes, then that would explain why I'm getting more misses in the PDE cache compared to the number of times it was supposedly accessed by a MISSES_CAUSES_A_WALK

Thanks,
CHP

McCalpinJohn · ‎06-01-2016

This is a great question, but I don't know of anyone who has developed a reliable methodology to test this particular question. We are not aware any major workloads on our (TACC) systems that are significantly slowed down by TLB misses, so it has not been a priority for me....

I have had a lot of trouble working through all the possible system configuration options that are related to this issue -- the structure of the documentation tends to combine descriptions of all of the available options in the same section (e.g., Process Context Identifiers either enabled or disabled in Section 4.5 of Volume 3 of the SW Developer's Guide). I find it very hard to stay focused on the parts describing the particular configuration that I have (and sometimes it is not easy to determine if a feature like PCID's is enabled -- the hardware may support it, but that does not mean that the OS is configured to enable that feature). All of this makes it very hard to determine from the documentation whether PML4 entries, PDPT entries, and PDE entries can be cached in the unified system memory cache structures. At one point I knew how AMD systems worked, but I don't know whether that was from public or proprietary documentation.... From a high-level perspective, I would assume that there is little reason to put PML4 entries in the caches -- a code probably only uses 1 or 2 of these, so the spatial locality of getting 8 in a single cache line is not likely to be helpful. At the other end, PDE entries should probably go into the regular caches because a lot of these may be used and the spatial locality may help. In the middle, it is not obvious whether putting PDPT entries in the regular caches makes sense. Although a code may use more than a few of these, you will typically traverse a whole bunch of other addresses before needing the next PDPT entry, so the cache line containing that entry will likely have been flushed from all of the caches. (On the other hand, you aren't likely to cache very many lines containing PDPT entries, so they can't displace very much of your other data.)

One good thing about this event (DTLB_LOAD_MISSES.*) is that it appears to only count TLB misses caused by demand loads, and not TLB misses caused by the Next-Page-Prefetcher. This is good because I don't know of any way to disable the Next-Page-Prefetcher, and it is not obvious what causes it to be triggered.... The long sequence of posts at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/593830 discusses which counters are incremented by demand TLB misses and which are counted by both demand and Next-Page-Prefetch TLB misses.

It probably possible to build a microbenchmark to test the address translation caches -- an excellent example of a methodology to experimentally determine these properties is http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.158.5585&rep=rep1&type=pdf

CPark16 · ‎06-01-2016

John, Thank you for the reply (and the interest ;)

The PDE caching structures (and the "Paging_structure Caches") I was referring to were the Page Table Walker caches.
The cache that is expected to be in the MMU (not the general caches). Separate in-MMU cache structures for the PDE, PDPTE and PML4 caches.
I guess my question was ambiguous. (My bad!)

Anyways, my question is does the Performance counter "DTLB_LOAD_MISSES.PDE_CACHE_MISS" Only count misses of the lower PDE Cache entries of the (MMU) page walker caches? or do they also count the higher level PML4 & PDPTE caches (in the MMU as well).

BTW, I found that Broadwell provides performance counters that count the number of hits of a page table walker in each level of the cache hierarchy. (So we have a counter that counts the number of page walk memory accesses that hit in the L1, L2, or L3 caches).
I don't have a broadwell machine to check them out, but they seem relevant to what you were pointing out in the second paragraph of your answer :)

Also thank you for the pointer to the microbenchmark! I'll make sure I check it out!

McCalpinJohn · ‎06-01-2016

The performance counter event that counts where TLB walks find the data is the PAGE_WALKER_LOADS.* event. It increments for both TLB walks due to demand load misses and for TLB walks due to the Next-Page-Prefetcher. Since we don't know what circumstances will cause the Next-Page-Prefetcher to be activated, this makes it a little trickier to use....

Does the PDE_CACHE_MISS performance counter count misses in the PDPTE & PML4 cache as well?