Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

TLB information provided by cpuid doesn't make sense to me

gostanian__richard
New Contributor I
364 Views

I have a skylake SP based server ( 2 sockets, 24 cores/socket)

 

I ran cpuid to find the size of the tlb and the stlb. The following (edited) output appeared

 

      0x63: data TLB: 1G pages, 4-way, 4 entries

      0x03: data TLB: 4K pages, 4-way, 64 entries

      0x76: instruction TLB: 2M/4M pages, fully, 8 entries

      0xb5: instruction TLB: 4K, 8-way, 64 entries

      0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

 

The first 2 lines seem to say, you can either have 64, 4K entries or 4, 1G entries in the data TLB.

The last line seems to say that you can have 1536 4K entries, or 1536 2M entries in the stlb.

Two questions occur.

1. How is is possible for the stlb to work with with the dtlb, when the dtlb has 1G pages and the stlb has either 4K or 2K page?.  Shouldn't the two tlb's have the same size pages?

2. In a system with huge pages, memory is going to have a mix of 4k and 1GB pages. How does the tlbs work with a mixture of huge and small pages?

I could pose similar questions about the itlb, but the answers to 1) and 2) should cover the itlb situation as well.

 

0 Kudos
1 Reply
McCalpinJohn
Black Belt
301 Views

It can be tricky to understand the TLB info from the CPUID instruction, and I seem to recall times when it was incorrect.

For the Skylake Xeon processor, the TLB is the same as in the Skylake Client processor, which is described in more detail in the Intel Architectures Optimization Reference Manual (document 248966-045, February 2022), Table 2-15.

That table says:

  • ITLB
    • 4KiB pages: 128 entries, 8-way associative, dynamically shared between threads when HT enabled
    • 2MiB pages: 8 entries per thread, associativity not described
    • 1GiB pages: (does not cache 1GiB page entries)
  • DTLB
    • 4KiB pages: 64 entries, 4-way associative, statically split between threads when HT enabled
    • 2MiB pages: 32 entries, 4-way associative, statically split between threads when HT enabled
    • 1GiB pages: 4 entries, 4-way associative, statically split between threads when HT enabled
  • STLB (I+D)
    • 4KiB pages and 2MiB pages: 1536 entries, 12-way associative, statically split between threads when HT enabled
    • 1GiB pages: 16 entries, 4-way associative, statically split between threads when HT enabled

This is mostly consistent with your CPUID output.  

 

Functionally, there is no requirement that any level of the TLB support any particular non-default page sizes. 

Remember that the TLBs are just caches for page table information -- if the information is not available in a particular level of the TLB, the hardware engine just checks the next level of the TLB or activates the Page Table Walker to fetch the information from memory.

Most combinations of mixed-page support have existed in various generations -- going back to the Sandy Bridge core, the DTLB had 32 entries for 2MiB pages and 4 entries for 1GiB pages, but the STLB did not support either.  That simply meant that for 2MiB and 1GiB pages, the TLB was a single-level cache, so missing in the DTLB would activate the hardware Page Table Walker for those accesses.  The Page Table Walker would find the required page table entries in memory (often in the L1D or L2 cache), then cache that data in the DTLB (only).

For mixed page sizes, Intel's descriptions are sometimes ambiguous (but have been getting better over time).  For the DTLB, it looks like the entries for different page sizes are independent resources.  For the STLB the description for SKX looks like 1536 entries -- each of which can hold translation information for either one 4KiB page or one 2MiB pages.  For 16 GiB pages the STLB either has 16 independent entries for 1GiB pages or overlaps the entries for 4K/2M pages -- in the latter case the amount of contention would be so small that it would be hard to measure the decrease in effective STLB size when using 1GiB pages in addition to 4k/2M pages.

Things get weirder with the Ice Lake core -- the DTLB is split into separate structures for loads and stores, with very different sizes.  Fortunately the sharing is clearly described in Table 2-8 of the most recent version of the optimization reference manual (Revision 045, February 2022).

 

All of the Intel SW Developer manuals are available at:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html 

Reply