Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Caching H/W of Address Translation Structures on Intel64 Architectures

drMikeT
New Contributor I
631 Views
Hello,

referring to "xeons" (nehalem, Westmere, SB) operating in the Intel64 "IA-32e Protected" and "Paging" mode (full 64-bit support, see http://download.intel.com/products/processor/manual/325384.pdf Vol3A) data in the "Memory Management" data structures (p2-8 Vol 3A) used in the effective to physical address translation mechanisms (p4-28 Vol 3A) can be cached by actual H/W: Section 4.10 "CACHING TRANSLATION INFORMATION" : "A processor may cache information from the paging-structure entries in TLBs and paging-structure caches".

The concept of TLB h/w caching page table entries discussed in subsection 4.10.2.2 is well known and the documentation elsewhere clearly highlights the TLB structures for each different micro-architecture.

For "Paging Structure Caches" of Section 4.10.3, it is mentioned that "A processor may support any or all the following paging-structure caches: PML4, PDPTE and PDE.... " data structures.

Does any of Xeons (Nehalem, Westmere, Sandy-Bridge) support any "Page Structure Cache" H/W ? I have NOT been able to find any reference for such H/W existing on any of these processors.

Should I assume that this is feature that is permitted by the ISA spec but has NOT been implemented by any if these processors?

Otherwise, can I find any more specific information about this H/W per processor?

I would appreciate any information or pointer to it ....

thanks
Michael

0 Kudos
8 Replies
Hussam_Mousa__Intel_
New Contributor II
631 Views
Hello Michael,

Section 4.10.3 which you referenced above later includes the following paragraph:

A processor may or may not implement any of the paging-structure caches. Software should rely on neither their presence nor their absence. The processor may invalidate entries in these caches at any time. Because the processor may create the cache entries at the time of translation and not update them following subsequent modifications to the paging structures in memory, software should take care to invalidate the cache entries appropriately when causing such modifications. The invalidation of

TLBs and the paging-structure caches is described in Section 4.10.4.


The SDM is intended for software developers, and so it is phrased to aid in writing of safe and portable code, and to avoid making excessive assumptions about HW that may not hold for other processor designs of the same family or different family.

What is the problem you are trying to resolve with this information? Perhaps I can find out or refer you to more useful information or sources.

-Hussam
0 Kudos
drMikeT
New Contributor I
631 Views
Hi Hussam, as a background I have been investigating memory hierarchy performance of the recent Xeons (Nehalem and forward) and I was wondering if the page table entries end up in the cache after a page walk on an address translation miss. Given the possible enormity of page table hierarchy in high miss-rate scenario data from page entries may then end up expelling app data from the cache. It is this statement you quoted to me that urged me to ask if other non-TLB H/W is also available to relieve the pressure from the caches. thanks for the reply, Michael
0 Kudos
cagribal
Beginner
631 Views
Hi Michael, Could you find an answer to your question? I am also curious about this, whether page directory or table entries being end up in caches. There is some information about paging-structure caches but it's not clear how much can be cached there and there are no specific details on processors, i.e. Nehalem or Sandy Bridge. Thanks for an update!
0 Kudos
drMikeT
New Contributor I
631 Views
Hi cagribal, unfortunately, I have not found anythng relevant to this yet. Therer are pros and cons in always installing page table entries in the cache or never doing this or a hybrid approach. I will post updates as soon as I run into something relevant Mike
0 Kudos
drMikeT
New Contributor I
631 Views
I have to accept that this must be a heavily guarded secret .... :_)
0 Kudos
Vipul_S_
Beginner
631 Views

Hello, I am also curious about this. Can someone please confirm?

0 Kudos
McCalpinJohn
Honored Contributor III
631 Views

It looks like the only way to figure this out for Intel processors is very careful designed microbenchmark testing.  The  discussion in section 4.10.3 of Volume 3 of the SWDM provides a reasonably clear explanation of the way that these caches (if they exist) are used in the page translation process. 

It will not be easy to build a microbenchmark for testing these structures, but it should be possible.  Hardware performance counters might also be useful for determining whether there is a jump in additional memory references as one increases the number of distinct entries accessed at each level of the hierarchical translation (indicating overflow of the cache at that level).

Following all of the cases through the documentation is difficult, but my current interpretation is that for my Xeon E5-2580 processors (Sandy Bridge EP) running in 64-bit mode ("IA-32e paging"), PCID's are enabled, which means that the higher-level entries in the hierarchical page table structure are read with the PAT type from index 0 of the PAT MSR (0x277), which is "UC-" on this system.   This is probably necessary because the paging-structure caches described in section 4.10.3 are augmented with 12-bit PCID values, and there is no place in the data caches to hold this extra information.    If this interpretation is correct, then overflowing any level of the paging-structure cache will generate an uncached load from memory.  This should be a lot easier to find via either timing or performance counters than a TLB walk that finds the paging-structure entries in the regular cache hierarchy.

0 Kudos
drMikeT
New Contributor I
631 Views

Hi John, thanks for the reply.

A next question is if upon repeated L2 TLB miss the paging table data structures also "polute" (enter) the cores' cache hierarchies ... I suspect that Intel may have some mechanism to throttle this from happening or the h/w page walks may just leave the cache memories unaffected.

Mike

0 Kudos
Reply