I have a couple of questions regarding the interaction of TLB, large pages and software prefetching:
1) As far as I understood from the documentation, for Nehalem and Sandy Bridge when using 2MB pages, there are only 32 entries available in DTLB1 and second level DTLB2 is not used. Can somebody confirm this?
2) When using 2MB pages, TLB miss handling cost, i.e. page walks are cheaper in 64-bit Linux due to 3-level page table directories instead of 4-level. Essentially, page walks require 3 times memory access instead 4 to read relevant page directory and page table entries. Does this reasoning make sense? Any ideas about the caching issues of page directory and page table entries?
3) Software prefetching becomes much more effective with 2MB pages in comparison to 4KB regular pages. What can be the reason for this observation? I always thought software prefetch instructions does not cross page boundaries but apparently the "Optimization Reference Manual" page 213 says that "In Intel Core microarchitecture, software PREFETCH instructions can prefetch beyond page boundaries and can perform one-to-four page walks". But there is no information regarding the situation in Nehalem and Sandy Bridge. Any ideas about this?
I believe that the conditions under which the page table walker uses the hierarchical caching structures are unclear to everyone....
Based on a variety of performance measurements for contiguous (or nearly contiguous) accesses, it is apparent that TLB misses are sufficiently inexpensive that one must concludes that almost all levels of the hierarchical page translation are cached with very high cache hit rates.
In one specific set of tests, the time spent performing table walks was very close to 14 cycles plus the cache hit latency for the level of the cache where the Page Miss Handler found the Page Table Entry. This used the DTLB_LOAD_MISSES.WALK_DURATION counter to determine the total time required for the table walks and the PAGE_WALKER_LOADS.* performance counter events to show where the PTEs were found. In this particular set of tests most of the PTEs were found in the L1, so the test is a stronger bound on the (assumed constant) overhead than on the latency for loading the PTEs from each level of the memory hierarchy.
It would require a fair bit of analysis to develop a testing methodology to try to extract the parameters of the caching mechanisms for the upper layers of the hierarchical address translation hardware. One would probably need to start by carefully directed testing to confirm the published parameters (size & associativity) of the DTLB and STLB, then more testing to understand how the STLB is "shared" between 4KiB and 2MiB translations in the Haswell core. Once that is all in place, it might be possible to design tests to attempt to overflow the caching mechanisms for the higher levels of the address translation mechanism, monitoring the results by both timing and by DRAM accesses.
I also read the text as saying that the address translation caching structures only cache the top 3 levels of the translation. Most of the information that is in the PTEs will be serviced from TLB hits, and the remainder should have good spatial locality in the "normal" data cache hierarchy. At higher levels of the translation hierarchy, entries can still be in the "normal" caches (depending on configuration settings that are really confusing to me -- or perhaps it is just the documentation of the configuration settings that is confusing!) but the interval between accesses to adjacent entries (in the same cache line) is large enough that they are unlikely to still be in the cache(s). For example, there are 8 PTEs in a cache line, so an STLB miss will bring a cache line containing 8 entries and put it in the L1 cache. This is precisely enough to map the 32 KiB of the L1 cache, so you should be able to access the entire L1 Data Cache before you evict the cache line containing the PTEs. This is consistent with my measurement of an L1 hit rate of 87.5% (7/8) for the PAGE_WALKER _LOADS.DTLB_L1 event.
My tests only showed a very small number of DTLB_LOAD_MISSES.PDE_CACHE_MISS events. This is not surprising given the relatively small address range I was using (so no strain on capacity) and the contiguous access patterns (which should not trigger any type of conflict misses).
We are going to start to monitor DTLB_LOAD_MISSES.WALK_DURATION on some of our production systems at TACC and if we find any significant applications that spend more than ~10% of their time in TLB walks, then I will dig into this further. My guess is that we won't find any codes that spend more than a few percent of their time in TLB walks, but I have been wrong before....
The discussion of the meaning of the PCD and PWT bits in the various upper levels of the translation entries is confusing to me, since the text is all full of caveats about the bits having different meanings in different modes of operation.
For example, in Section 2.5 of Volume 3 of the SWDM, the text says that for CR3, the PCD and PWT bits control whether or not the memory reference that accesses the top level of the page table entry (PML4) is cached or not. The description is in Section 4.9.2 where it appears to say that
- If CR4.PCIDE=1 (process context identifiers are enabled), then the type is taken from element 0 of the PAT Table.
- If the paging mode is IA32e (64-bit), but CR4.PCIDE is 0 (process context identifiers are not enabled), then the type is taken from element 2*PCD+PWT of the PAT Table, so it could be any of 0,1,2,3
The PAT Table is held in the IA32_PAT MSR (0x277), with types are defined in Tables 11-10, 11-11, and 11-12.
Table 11-12 says that the power-on/reset values of PAT Table entries 0,1,2,3 correspond to WB, WT, UC-, and UC types, respectively. But I noticed that Linux changes these -- on my systems (Haswell and Sandy Bridge) I see:
rdmsr -p 0 -x -0 0x277
Reading from right to left, the four entries are:
- PAT0 = 0x06 = WB (WriteBack)
- PAT1 = 0x01 = WC (Write Combining)
- PAT2 = 0x07 = UC-
- PAT3 = 0x00 = UC
So this says that PML4 entries are always loaded as cached accesses if Process Context Identifiers are enabled (as they are on my Haswell systems). However, if Process Context Identifiers are disabled (as they are on my Sandy Bridge systems), then the memory type used to access the PML4 entry can be of type PAT0, PAT1, PAT2, or PAT3, or WB, WC, UC-, or UC, respectively.
Of course these are subject to being overridden by stricter MTRR values, but since the page tables live in the main part of system memory, I will assume that this won't happen.
I don't know any easy way to get the contents of the CR3 register for a currently running process so that I can examine the PCD and PWT bits.
The same process is repeated for PDPT, PDE, and PTE entries (based on the PCD and PWT bits in the "next level up" PML4, PDPT, and PDE entries, respectively). These don't appear to be influenced by enabling Process Context Identifiers.
So I don't know what is really happening without looking at the PCD and PWT bits in CR3 (on the systems where Process Context Identifiers are disabled) and at the PCD and PWT bits in the PML4 and PDPT entries. I know that the PCD and PWT bits in the PDE entries must map to the WB memory type because the PAGE_WALKER_LOADS.* event routinely finds most of the PTEs in the caches.
Putting the higher-level page translation entries in the caches has three effects: (1) it reduces latency if you need to reload the same value before it gets flushed from the cache, (2) it reduces latency if you need to load adjacent values (in the same cache line) before the line gets flushed from the cache, and (3) it displaces a data (or instruction) cache line from the cache. I would guess that (3) is not a problem. I would also guess that (1) and (2) are not likely to happen very often for PML4 and PDPT entries if there is a non-stupid specialized cache for these entries. So we have a combination of not much upside and not much downside in most cases. There is probably some discussion of this in the Linux kernel, but I find it nearly impossible to work through all the levels of macros and configuration options to get to the actual behavior for my system.
For PDEs, I think caching makes sense -- a cache line contains 8 PDEs, each of which points to a cache line that will hold 8 PTEs, so the cache line holding PDEs helps to map 8*8*4=256KiB. This is large enough to be useful and small enough that you should be able to access it all before the cache line holding the PDE gets evicted from the L2 cache.