This is a great question, but I don't know of anyone who has developed a reliable methodology to test this particular question. We are not aware any major workloads on our (TACC) systems that are significantly slowed down by TLB misses, so it has not been a priority for me....
I have had a lot of trouble working through all the possible system configuration options that are related to this issue -- the structure of the documentation tends to combine descriptions of all of the available options in the same section (e.g., Process Context Identifiers either enabled or disabled in Section 4.5 of Volume 3 of the SW Developer's Guide). I find it very hard to stay focused on the parts describing the particular configuration that I have (and sometimes it is not easy to determine if a feature like PCID's is enabled -- the hardware may support it, but that does not mean that the OS is configured to enable that feature). All of this makes it very hard to determine from the documentation whether PML4 entries, PDPT entries, and PDE entries can be cached in the unified system memory cache structures. At one point I knew how AMD systems worked, but I don't know whether that was from public or proprietary documentation.... From a high-level perspective, I would assume that there is little reason to put PML4 entries in the caches -- a code probably only uses 1 or 2 of these, so the spatial locality of getting 8 in a single cache line is not likely to be helpful. At the other end, PDE entries should probably go into the regular caches because a lot of these may be used and the spatial locality may help. In the middle, it is not obvious whether putting PDPT entries in the regular caches makes sense. Although a code may use more than a few of these, you will typically traverse a whole bunch of other addresses before needing the next PDPT entry, so the cache line containing that entry will likely have been flushed from all of the caches. (On the other hand, you aren't likely to cache very many lines containing PDPT entries, so they can't displace very much of your other data.)
One good thing about this event (DTLB_LOAD_MISSES.*) is that it appears to only count TLB misses caused by demand loads, and not TLB misses caused by the Next-Page-Prefetcher. This is good because I don't know of any way to disable the Next-Page-Prefetcher, and it is not obvious what causes it to be triggered.... The long sequence of posts at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring... discusses which counters are incremented by demand TLB misses and which are counted by both demand and Next-Page-Prefetch TLB misses.
It probably possible to build a microbenchmark to test the address translation caches -- an excellent example of a methodology to experimentally determine these properties is http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.158.5585&rep=rep1&type=pdf
John, Thank you for the reply (and the interest ;)
The PDE caching structures (and the "Paging_structure Caches") I was referring to were the Page Table Walker caches.
The cache that is expected to be in the MMU (not the general caches). Separate in-MMU cache structures for the PDE, PDPTE and PML4 caches.
I guess my question was ambiguous. (My bad!)
Anyways, my question is does the Performance counter "DTLB_LOAD_MISSES.PDE_CACHE_MISS" Only count misses of the lower PDE Cache entries of the (MMU) page walker caches? or do they also count the higher level PML4 & PDPTE caches (in the MMU as well).
BTW, I found that Broadwell provides performance counters that count the number of hits of a page table walker in each level of the cache hierarchy. (So we have a counter that counts the number of page walk memory accesses that hit in the L1, L2, or L3 caches).
I don't have a broadwell machine to check them out, but they seem relevant to what you were pointing out in the second paragraph of your answer :)
Also thank you for the pointer to the microbenchmark! I'll make sure I check it out!
The performance counter event that counts where TLB walks find the data is the PAGE_WALKER_LOADS.* event. It increments for both TLB walks due to demand load misses and for TLB walks due to the Next-Page-Prefetcher. Since we don't know what circumstances will cause the Next-Page-Prefetcher to be activated, this makes it a little trickier to use....