TLB snooping

Salame__David · ‎01-28-2018

Hi Everybody,

I have a question I found it hard to find an answer about and will greatly appreciate your help.
Assumption: Intel latest CPUs, either desktop versions & | latest Xeons (running either x64 or IA64).

Preface:
So in such multicore environment,
If the OS needs to update the page tables for a process currently running in a core,
It will have to invalidate that VA's entry in the TLB off course.

And so the question is, what happens with the other cores?
Does the CPU automatically snoops that invalidation to the other cores using the interconnect / MESI (or equivalent) protocol?
Or, must the OS halt the other cores, and plant a "tlb_invalidate_va" command as the next executed command, and continue after that (which might be the way things happened in the past, but seems extremely inefficient to me).

Surely, the OS can optimize this a bit more and check w/ the scheduler if any of the other cores currently run this context (or in the past, which for TLB w/ tagging support means we might still have an outdated entry there), but for any app w/ threads, this will still generate the above mentioned behavior, which seems extremely non-efficient for such modern processors (i.e : is this really happens for every mmap call ?!?)

* another tiny question though less urgent: half of the sources on the net says unlike ARM, x86 only have "flush all tlb" command, and the other half I've read says modern intel cpus have flush-tlb-per-address. anyone knows if such option available in modern CPUs?

Thanks you all and greatly appreciate any answer,
David.

McCalpinJohn · ‎01-31-2018

As far as I can tell, all of this is well documented in the Intel Architectures Software Developers Manual set, available at https://software.intel.com/en-us/articles/intel-sdm

Volume 1 ("Basic Architecture") only contains a few references to TLB invalidation, but Section 5.20 lists the "System Instructions", including the instructions to invalidate TLBs and PCIDs. Section 5.22 mentions the TLB-management instructions when using Intel's Virtual Machine Extensions (VMX). Section 16.3.8.1 notes a potential interaction between TLB management instructions and Intel's Transactional Synchronization Extensions (TSX).

Volume 2 ("Instruction Set Reference") lists all of the supported instructions in current implementations. The description of the INVLPG (Invalidate TLB Entries) includes references to other important references, including the "MOV (Move to/from control registers)" instruction in the same volume and a reference to an important section in Volume 3.

Volume 3 ("System Programming Guide") contains the detailed descriptions. Section 4.10 discusses "Caching Translation Information" and 4.10.4 covers invalidation of TLBs and paging-structure caches, including (4.10.4.1) a list of all operations that can invalidate TLB and paging-structure entries, (4.10.4.2) recommended procedures for invalidation in various cases, (4.10.4.3) cases in which software may choose not to invalidate, and (4.10.4.4) cases in which invalidation can be delayed.

Volume 4 ("Model-Specific Registers") does not contain anything directly related to your questions, but it is a very interesting read :-)

Salame__David · ‎01-31-2018

Thanks John !
I'll go over these for sure.

But the question is, how any of the above mentioned conditions (even in the references you mentioned), interacts in a multi-core sense of matter? I mean, all of these are instructions that the OS will give each core, but does that means that for a coherent page table update,the OS must do a "shootout" of the other cores by raising an interrupt to each of them, one in which they will each run the invalidation for that TLB record and then it will keep running as usual ?

I was hoping for some HW automatic handling of these but the more I read the more it seems that the HW lacks such support ?!?

Thanks in advance,

David.

McCalpinJohn · ‎01-31-2018

The most common method to flush the TLBs is as part of the process context switch code. The "MOV to CR3" instruction changes the CR3 register, which points to the top of the process's private page tables. As a side effect, the TLB (and other page translation caches) are invalidated by the change to CR3. This only needs to be done on the Logical Processor where the context switch is occurring -- if other Logical Processors are running with the same CR3, they can continue using those page translations. If other Logical Processors are running with different CR3 values, they are not effected by the context switch.

For the case of changing a single page translation (e.g., to swap the page to disk), then the INVLPG instruction is used. On a multi-processor system, the "brute force" approach would require setting up an inter-processor interrupt to every Logical Processor in the system to get them to execute the instruction. There are a number of conditions under which it may be possible to prove that this is not necessary. For example, if the page is "private", only Logical Processors running with the same CR3 value would need to run the INVLPG. It is not clear to me whether Linux does any of these sorts of optimizations.... When the Linux kernel needs to invalidate multiple pages, it has code to decide whether it is faster to perform the sequence of INVLPG instructions or use a MOV to CR3 to just flush everything.

Salame__David · ‎01-31-2018

I see.

But when you change CR3, you change it for the current processor (core) right?
So if for example I have 2 threads running in the same process, each in a different core,
And I change a page table entry using one of these cores, and then I flush everything (for that specific core) using CR3 modification, the second core which still run that original process, is not flushed by the CR3 change in the first core right? so it's TLB is not cleared and it will use the old mappings.

Is that correct?

If so, it appears to me that indeed, originating a sw-interrupt on the other relevant cores is the only way to go.
Unless I'm missing something here.

McCalpinJohn · ‎02-01-2018

You are correct -- changing a page table entry requires that the OS set up an interprocessor interrupt to each logical processor that may have a valid copy of the target Page Table Entry, so that each logical processor runs the required INVLPG instruction locally.

This is the minimal requirement -- flushing any superset of the target TLB entry on the target logical processors obviously satisfies the same requirement. In the most extreme case, this could include flushing all TLBs on all logical processors -- though that would obviously invite performance problems....

This is discussed in Section 4.10.5 "Propagation of Paging-Structure Changes to Multiple Processors" in Volume 3 of the Intel Architectures Software Developers Manual.

Other architectures have different methods for invalidation of page tables, including snooping the page table addresses (forcing invalidation using the normal cache coherence mechanisms), or specialized coherence transactions that target page translations. Both of these approaches are external invalidations that don't require that each core run a "local invalidate" instruction.

The x86 approach also requires the normal cache coherence mechanisms to invalidate any copies of the page table entries that are in the data caches. Different levels of the 4-level page translation hierarchy can have different caching rules, but my understanding is that the most common configuration accesses the bottom two levels of the hierarchy (Page Table Entries and Page Directory Entries) via normal cached accesses, so when any of those addresses are written, the normal cache coherence mechanisms will invalidate the corresponding cached entries. Obviously one has to be careful of ordering when writing the page table update code -- you need to be sure that the stale PTEs are invalidated from the cache before you execute the INVLPG instruction, or on return from the interprocessor interrupt, the process might immediately miss in the TLB and find the stale PTE in the cache. My recollection is that the the two upper levels of the translation (PDPTE and PML4E entries) are typically accessed using uncached loads, so they will not be saved in the data caches. Instead, they are saved in the specialized "paging-structure caches" that are referred to in Intel's documentation (e.g., Section 4.10.4 "Invalidation of TLBs and Paging-Structure Caches"). Section 4.10.3 of Volume 3 of the SWDM says that processors may also have a "page-structure cache" for PDE entries, in which case those would probably not be loaded using cached accesses.

When reading the documentation one can easily become confused by all the possible configurations. I have found it helpful to have one specific OS version in mind, so when I come to a fork in the configuration space in the documentation, I can login to the OS I am studying and try to figure out which branch of the documentation I should follow. This is not always easy, but between runtime configuration checks and source code perusal, it is usually possible to figure out what features the OS is trying to use.