Core-to-Core Communication Latency in Skylake/Kaby Lake

Brian_H_2 · ‎10-21-2016

Hello, The Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 Processors[1] claims that the latency for reading a cache line modified on another core is ~75 cycles. This document was written in the Nehalem era (2008). In Skylake, is the core-to-core communication latency still twice as slow as a regular L3 hit, or has Intel added more complicated cache communication infrastructure? Core-to-core communication latency determines how efficiently threads can communicate; e.g. in a producer-consumer scenario a naive consumer might miss on every unit of work, wasting maybe ~75 cycles. Is reading a modified cache line still a ~75 cycle penalty on the newest architectures? Thanks, Brian Hempel [1] https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

McCalpinJohn · ‎11-02-2016

On-chip cache-to-cache intervention latency for modified data is generally going up as core counts increase and frequencies remain flat.

I don't see any reason for this to change in the next generation Xeon processors -- the number of cores is expected to increase and the uncore frequency is expected to stay flat or decrease.

In general, implementation changes that allow further increases in throughput (e.g., Cluster on Die) can give slightly better intervention latencies within the "cluster" (i.e., half of the cores in a package), but at the cost of significantly higher cache-to-cache intervention latencies outside of the "cluster". On a Xeon E5-2660-v4 (Broadwell EP, 10-core, 2.0 GHz nominal) the Intel Memory Latency Checker tool reports L2 to L2 modified intervention latency (same socket) of 40.3-42.3 ns, depending on the snoop mode of the processor and the frequency of the uncore. Since only 2 threads are needed to run this test, the cores are probably running at their max Turbo frequency of 3.0 GHz, making this correspond to 121-127 core cycles. If I switch to "Cluster on Die" mode, the value is reduced slightly to 37.6 ns, or about 113 core cycles.

The producer/consumer use case is one that can be helped by forcing the uncore frequency to stay at the maximum value. For high-bandwidth workloads (like STREAM), the "energy efficient Turbo" mechanisms will quickly ramp the uncore frequency to the maximum, but a producer/consumer code may not generate enough traffic to cause the hardware to boost the uncore frequency.

There are lots of ways to make producer/consumer latency better, but to get the best results this needs to be associated with a new, explicitly visible hardware mechanism for low-latency interprocessor communication. I personally think it is inevitable that such mechanisms will be developed, but it is not yet clear whether this will ever happen in the x86 architecture.

jimdempseyatthecove · ‎11-14-2016

John,

This is a rather esoteric question relating to your last post.

I am assuming that the latencies you reported were for the data written to one cache to be read from the other cache. The write-to-read is conceptually one operation, but in fact it is several. The write to one L2 posts an invalidate to the other L2(s). If in the single producer, single consumer case, say, the producer is bursting in a batch of entries (push interval less than L2 to L2 modified intervention latency), does each subsequent push prevent the other core from performing the read (e.g. of the fill pointer) until the burst is complete? If so, then the latency could be much worse.

Jim Dempsey

McCalpinJohn · ‎11-15-2016

I usually implement a producer/consumer code using "data" and "flag" in separate cache lines. This enables the consumer to spin on the flag while the producer updates the data. When the data is ready, the producer writes the flag variable. At a low level, the steps are:

The producer executes a store instruction, which misses in its L1 Data Cache and L2 cache, thus generating an RFO transaction on the ring.
1. The data for the store is held in a store buffer at least until the producer core has write permission on the cache line in its own L1 Data Cache.
2. The data for the store may be held in the store buffer for longer periods, awaiting other potential stores to the same cache line.
The RFO traverses the ring to find the L3 slice that owns the physical address being used, and the RFO hits in the L3.
The L3 has the data, but also sees that it is marked as being in a clean state in one or more processor private caches, so it issues an invalidate on the cache line containing the flag variable.
1. The L3 may or may not have a way of tracking whether the cache line containing the flag variable is shared in another chip.
2. The L3 may have directories that track which cores may have a copy of the cache line containing the flag variable. If there are directories, the invalidates may be targeted at the specific caches that may hold the cache line, rather than being broadcast to all cores.
3. The L3 may send the data for the cache line containing the flag to the producer's cache at this time, but that data cannot be used until the coherence transactions are complete.
The consumer receives the invalidate, invalidates its copy of the flag line, and responds to the L3 that the line has been invalidated.
1. Immediately after responding to the L3 that the line has been invalidated, the spin loop in the consumer tries to load the flag variable again.
The L3 receives the invalidation acknowledgements from all the cores that may have had a shared copy of the line, and notifies the producer core that it now has write permission on the cache line containing the flag data.
1. If you are lucky, the producer core will write the flag variable from the store buffer to the L1 Data Cache immediately on receipt of permission.
2. If you are unlucky, the producing core may lose the line containing the flag variable before the store buffer is written to the L1 Data Cache. I don't know if Intel processors can do this, but I know some processors than can lose the line before dumping the store buffer.
Very shortly after sending the write permission notification to the producer core, the L3 will receive a read request from the consumer core for the same cache line containing the flag.
1. Depending on the implementation, several different things might happen.
2. One option is for the L3 to hold the line containing the flag variable in a "transient" state while it waits for an acknowledgement from the Producer core that it has received the write permission message. In this case the L3 will either:
  1. Stall the read request from the consumer core, or
  2. NACK the read request from the consumer core (i.e., tell it to try again).
3. Another option is for the L3 to immediately process the read request and send an intervention/downgrade request for the cache line containing the flag variable to the producer core's cache.
In the "lucky" case, the intervention/downgrade request generated by the read from the consumer core will get the new value of the cache line containing the flag variable and return it to the consumer core and to the L3 slice that owns that physical address.
1. Various implementations have specific ordering requirements here that determine whether the cache line must be sent to the L3 first, then the to consumer core, or whether it can be sent to both at the same time.
2. Some implementations require an extra handshaking step after the consumer core receives the data, before the L3 will give it permission to use the data. (This is more common in the case of a store than a read.)
Finally the consumer core gets the new value of the flag variable and sees that it has changed! The data is now ready!
The spin loop on the consumer core now exits, which incurs a 20-cycle mispredicted branch delay.
The consumer core now executes a load instruction to get the data. This misses in the consumer's L1 and L2 caches and generates a read request on the ring.
The read request traverses the ring to the slice that owns the physical address of the cache line containing the data (which may be a different slice than the one controlling the cache line containing the flag), and the read request hits in the L3.
The data in the L3 is stale, but the L3 knows exactly which core has write permission on the cache line containing the data, so it issues an intervention/downgrade on the cache line containing the data and targeting the cache of the producer core.
The cache(s) of the producer core receive the intervention/downgrade request and return the new value of the cache line containing the data variable to the L3, simultaneously downgrading the permissions on the line so that it is now "read-only" in the producer's caches.
As was the case for the cache line containing the flag variable, the cache line containing the data variable makes its way to both the L3 slice that owns the physical address and the consumer core that requested the data.
The cache line containing the data arrives in the consumer core's cache, and the load instruction is allowed to complete.
Once the consumer core has gotten the data safely into a register, it typically has to re-write the flag variable to let the producer core know that the value has been consumed and that the producer core is free to write to the cache line containing the data variable again.
1. This requires the consumer to make an "upgrade" request on the cache line containing the flag, so it can write to it. This is similar to the sequence above, but since the consumer already has the data, it does not need the L3 to send it again -- it only needs to wait until all other cores have acknowledge the invalidate before it can write to the flag line.
2. Double-buffering can be used to avoid this extra transaction -- if the consumer uses a different set of addresses to send data back to the producer, then the fact that the producer has received another message from the consumer means that the consumer must have finished using the original data buffer, so it is safe for the producer to use it again.

There are many variants and many details that can be non-intuitive in an actual implementation. These often involve extra round trips required to ensure ordering in ugly corner cases. A common example is maintaining global ordering across stores that are handled by different coherence controllers. This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.

There are fewer steps in the case where the "data" and "flag" are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line. This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).

jimdempseyatthecove · ‎11-22-2016

Excellent !!

Your discussion provides critical information for use in developing high(er) throughput queue systems. And it indicates that simplified coding of a queue will likely yield a lesser performing function.

Thanks for the information.

Jim Dempsey

Jason_V_ · ‎08-25-2017

@"Dr. Bandwidth"'s excellent #4 response above - answers my question at:
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/509417#comment-1911253

So am I to believe that the above cache invalidation strategy is the only way for a program to explicitly atomically transfer
a 64-bit value from one physical (or logical/hyperthread) core to another without going out to RAM and back,
on a Haswell-MB 4910 chip, not having QPI, is the above method in Comment #4 ?
I believe I am correct to assume my mobility Haswell chip does not have a QPI ring bus ?
If I am wrong, is there no way for code to explicitly use the QPI ring bus (beyond arranging that a consumer core is
trying to access an address that has just been written by a producer core to L2/L3 caches, not flushed, and the consumer
is prefetching / reading that address ?

McCalpinJohn · ‎08-28-2017

The Core i7-4910MQ does have an internal ring that connects the 4 cores with each other and with the 4 slices of the L3. See, for example, slide 7 of Intel's presentation at https://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.80-Processors2-epub/HC25.27.820-Haswell-Hammarlund-Intel.pdf

The sequence of events for a two-cache-line producer/consumer setup is similar to what I outlined above, except that there is no case where you might need to send invalidate commands to another chip....

Yizhou_Shan · ‎07-02-2019

Hi Dr. Bandwidth,

Thank you for this extremely detailed explanations. I was exploring this area recently, and I found it very informative. I'm very confused about Intel's underlying cache coherence implementation, I find many mismatches among online resources.

For the above case, I still have couple doubts:

Will L3 slice's caching agent contact same socket's home agent for conflict solving?
If the memory belongs to another socket, what will be the final steps to complete the transaction? Will the L3 caching agent contact some socket's home agent first, or will it contact same socket's R3QPI first?

In general, I also have couple doubts about the coherence transaction flow:

Physical addresses are hashed to each L3 slice. Is this per-socket or global-system wise? E.g., if an address is hashed into a remote socket's L3 slice, does it mean a local L2 miss needs to go across QPI/UPI?
For local socket memory, it seems L3 slice will generate/collect snoops messages in the first round, and then send an aggregated request to the same socket's home agent for further conflict resolving. Is this true? If true, it seems a lot roundtrip.
For the above 2), what if the physical address belongs to another remote socket? What will the steps? E.g., L3 collects responses and sends to R3QPI?
A big question, how does L2 controller, L3 Caching Agent, Home Agent, and R3QPI work with each other in general? Especially the latter three, I found they seem to play similar roles in the coherence transaction. And the documentation is vague.

The reason I'm asking is I found following descriptions from (https://software.intel.com/en-us/download/intel-xeon-processor-e5-and-e7-v3-family-uncore-performance-monitoring-reference-manual). Especially the part regarding the Cbo. They don't seem to be very clear:

Section 2.3: "So, if the CBo fielding the core request indicates that a core within the socket owns the line (for a coherent read), the request is snooped to that local core. That same CBo will then snoop all peers which might have the address cached (other cores, remote sockets, etc.) and send the request to the appropriate Home Agent for conflict checking, memory requests and writebacks."
Section 2.5: "The Home Agent supports the Intel QPI home snoop protocol by initiating snoops on behalf of requests. Closely tied to the directory feature, the home agent has the ability to issue snoops to the peer caching agents for requests based on the directory information."
Section 2.11: "The R3QPI agent implements a latency-reducing optimization for dual sockets which issues snoops within the socket for incoming requests as well as a latency reducing optimization to return data satisfying Direct2Core (D2C) requests.

I've also read documents like QPI spec, and MESIF paper. The protocol just seem to differ.

Thank you for your time.

McCalpinJohn · ‎07-03-2019

The interaction between the L3 slices and the Home Agents varies by processor family. For processors up to Broadwell, a chip typically had one or two home agents, while Skylake Xeon distributes the Home Agent's responsibility to the "CHA" units -- one of which is co-located with each of the L3 slices. On SKX/CLX, the L3 is no longer inclusive, but the CHA includes a "Snoop Filter" that contains the location information that the L3 directory would have provided if the L3 were inclusive. SKX also enables "memory directories" by default, which can be used to filter remote snoops or otherwise change the timing properties of the local and remote portions of the coherence transaction.

If the memory belongs to a different socket, the entire flow is different. On the initial cache miss, the read request (or RFO request) is not sent to the CHA+L3, but is sent instead to the QPI (or UPI) interface to be sent to the address's "home" socket. At the "home" socket, the read request is sent to the L3 slice that owns the address for processing, etc.

Address to L3 (or CHA+L3) hashing is usually per-socket. (In some processors a socket can be split into multiple NUMA nodes that can be hashed independently, but address hashing is never global.)

For local memory, the "traditional" approach is to send a snoop request to the other socket(s) in parallel with the local L3 lookup. This provides the most overlap and the lowest latency. Unfortunately as processors have gotten faster, the snoop traffic can use a large fraction of the chip-to-chip interconnect bandwidth. So SKX enables "memory directories" by default. If the UPI links are busy, the processor can wait to get the cache line from local memory before deciding on the remote snoop. If the directory bit says that no other chip has permission to have a dirty copy of the line, then the data from memory is valid, and (for reads) no remote snoop is required. This reduces the link utilization at the cost of higher latency for loads that do need to snoop the other socket.

It is certainly true that the documentation is very vague.... The most interesting "documentation" is implicit in the uncore performance counter events -- particularly the opcode matching functions.

Yizhou_Shan · ‎07-03-2019

Hi Dr. Bandwidth,

Thank you for your reply.

McCalpin, John (Blackbelt) wrote:
If the memory belongs to a different socket, the entire flow is different. On the initial cache miss, the read request (or RFO request) is not sent to the CHA+L3, but is sent instead to the QPI (or UPI) interface to be sent to the address's "home" socket. At the "home" socket, the read request is sent to the L3 slice that owns the address for processing, etc.
Address to L3 (or CHA+L3) hashing is usually per-socket. (In some processors a socket can be split into multiple NUMA nodes that can be hashed independently, but address hashing is never global.)

I'm a little confused about the above two paragraphs. The first paragraph seems to indicate that an initial L2 miss is able to send read request (or RFO) directly to QPI without sending requests to local socket's L3 slice. That implies address to L3 hashing is global. For example, the physical address [0-256MB] only maps to a L3 slide at socket 1, but not any slices at socket 0. While your second paragraph says the hashing is per-socket. Which one is correct?

McCalpin, John (Blackbelt) wrote:
For local memory, the "traditional" approach is to send a snoop request to the other socket(s) in parallel with the local L3 lookup. This provides the most overlap and the lowest latency. Unfortunately as processors have gotten faster, the snoop traffic can use a large fraction of the chip-to-chip interconnect bandwidth.

Are you referring to the "Source Snoop" model here? Can we assume "source snoop" has the following general steps: 1) L2 cache controller will broadcast snoop messages to all other cores' L2 cache controller, also send snoop to local L3 and QPI (for other sockets). 2) Other parties will reply the snoop, they will send reply to either the L3 slice or the home agent (which is unsure).

McCalpin, John (Blackbelt) wrote:
It is certainly true that the documentation is very vague.... The most interesting "documentation" is implicit in the uncore performance counter events -- particularly the opcode matching functions

Indeed. I probe around the PMU manuals, seem to find a lot related events/opcodes etc, but it only makes me more confusing. For models before Skylake Scalable, the L3 caching agent and home agent seem to be able to provide similar functionalaties, as in both of them have directory (is this correct?), that means both of them can serialize transactions if conflicts.

I'm more leaning towards to believe: for the home snoop model, L2 send requests to L3 caching agent (either local socket or remote socket via QPI). Since L3 caching agent can have directory, it is able to send minimum snoop traffic. While the home agent holds the final decision if a coherence transaction complete or not. That means home agent will send the final ACK to L2 core. Do you think this is correct?

Thank you.

McCalpinJohn · ‎07-05-2019

Agents within the processor that need to route address-related requests all have a copy of the "Source Address Decoder" (SAD) registers. These allow the agent to determine the NUMA node responsible for any physical address. So on an L2 miss, the interface to the ring uses the SAD registers to decide how to process the request:

If the address is local, it is hashed to determine which L3 (or L3+CHA) needs to handle the request.
- I am not aware of any documentation on the configuration of the hash -- it appears to be fully hard-coded into the processor and transparent to software.
If the SAD registers indicate that the address belongs to another NUMA node, the request is sent to the appropriate QPI/UPI interface, where it is transported to the "home" node for the address.

The "Target Address Decoder" (TAD) registers are used are used within each NUMA node to determine the mapping of physical addresses to memory controllers and memory channels.

There is a nice report at https://publikationen.bibliothek.kit.edu/1000073678 that provides an overview of the address translation process on Xeon processors.

The "snoop modes" that are available on some processors (notably Haswell/Broadwell) change some of the transactions.

In "Early Snoop" mode, an L3 miss from a core in socket 0 to an address in socket 1 will send a "snoop" request from socket 0 to socket 1.
In "Home Snoop" mode, an L3 miss from a core in socket 0 to an address in socket 1 will send a "read request" from socket 0 to socket 1 -- then socket 1 will generate any required snoops.
- In a 2-socket system, no additional snoops are required -- the request already missed in socket 0, and in socket 1 the read request will be processed by the L3/CHA slice that owns the address.
- In a four-socket system, socket 1 would be responsible for sending snoops to sockets 2 and 3 (unless a directory indicated that these snoops were not necessary).
Many variations are possible in both the flow and the timing. Some of the details can be reverse-engineered using the uncore performance counters, but it is important to note that Intel's recent processors have dynamically adaptive behavior. The transaction sequence that you see in a microbenchmark may change for processors with different core counts, for the same processor with higher or lower loading on the ring -- or potentially even for different settings of the energy-performance bias....

Intel processors don't broadcast snoops to the other L2 caches on L2 misses -- there is nowhere near enough L2 tag bandwidth to handle all these requests. This is where the address hash is critical! Each physical address is owned by exactly one L3 slice (and one Home Agent), so on an L2 miss (to a local address), the request is sent to the L3 slice that owns the address. For processors with inclusive L3 caches, that L3 slice is guaranteed to have tracking information for the cache line address if any local core has a copy of the line in its private L1 or L2 caches. For processors without inclusive L3 caches (Skylake Xeon, Cascade Lake Xeon), there is a "snoop filter" in each CHA that is guaranteed to have the same tracking information for all lines in (local) L1 and/or L2 caches. For dirty lines, the L3 (or snoop filter) sends an intervention request to the core that has permission to have a dirty copy of the line. For widely shared cache lines there are several options for processing. Some systems track all sharers and send a separate invalidate message to each. Some systems send a full broadcast invalidation. It is also possible to send a broadcast invalidate with a bit mask of processor numbers who need to process the invalidation (so if a processor receives such an invalidation it checks its bit in the mask and only has to snoop its cache tags if its bit is set). It is also possible to have a hybrid system that changes behavior depending on system load and/or energy efficiency requirements. It is also possible to combine such a system with other history+predictor mechanisms to handle coherence differently for certain classes of traffic. (One example is producer-consumer traffic -- when a Load hits a line in M state in another cache, most protocols return the data in Shared state and downgrade the data to Shared state. For producer/consumer "ping-pong" patterns, it is more efficient to transfer the cache line in M state (invalidating it at the previous home), so that the "consumer" can immediately update it.)

As you note, there are also multiple possible paths for the return traffic. Maintaining consistency is easier if the data is sent to the L3 first, and then to the requesting core, but it is also possible to send to both at the same time (e.g., "Direct2Core"). In recent processors, these return paths are chosen dynamically based on undocumented states and settings of the processor.

Yizhou_Shan · ‎07-05-2019

McCalpin, John (Blackbelt) wrote:
Agents within the processor that need to route address-related requests all have a copy of the "Source Address Decoder" (SAD) registers. These allow the agent to determine the NUMA node responsible for any physical address. So on an L2 miss, the interface to the ring uses the SAD registers to decide how to process the request:
If the address is local, it is hashed to determine which L3 (or L3+CHA) needs to handle the request.
If the SAD registers indicate that the address belongs to another NUMA node, the request is sent to the appropriate QPI/UPI interface, where it is transported to the "home" node for the address.

Got it. This further explains the case even when data is cached in L3, using local socket's memory is faster than remote socket's. Simply because the latter requires all coherence traffic go through QPI/UPI.

McCalpin, John (Blackbelt) wrote:
The "snoop modes" that are available on some processors (notably Haswell/Broadwell) change some of the transactions.
In "Early Snoop" mode, an L3 miss from a core in socket 0 to an address in socket 1 will send a "snoop" request from socket 0 to socket 1.
In "Home Snoop" mode, an L3 miss from a core in socket 0 to an address in socket 1 will send a "read request" from socket 0 to socket 1 -- then socket 1 will generate any required snoops.
In a 2-socket system, no additional snoops are required -- the request already missed in socket 0, and in socket 1 the read request will be processed by the L3/CHA slice that owns the address.
In a four-socket system, socket 1 would be responsible for sending snoops to sockets 2 and 3 (unless a directory indicated that these snoops were not necessary).
Many variations are possible in both the flow and the timing. Some of the details can be reverse-engineered using the uncore performance counters, but it is important to note that Intel's recent processors have dynamically adaptive behavior. The transaction sequence that you see in a microbenchmark may change for processors with different core counts, for the same processor with higher or lower loading on the ring -- or potentially even for different settings of the energy-performance bias....

Clear enough. No matter what models we are using, L2 won't be the initiator of any coherence traffic. An L2 miss will go to its corresponding L3 first, who will act on L2's behalf. Knowing L3 is always the initiator makes things more clear.

According to uncore PMU manual: "the R3QPI agent implements a latency-reducing optimization for dual sockets which issues snoops within the socket for incoming requests as well as a latency reducing optimization to return data satisfying Direct2Core (D2C) requests." Because it will issue snoop for incoming requests, I guess it is only used by "Home Snoop" mode, right?

McCalpin, John (Blackbelt) wrote:
It is also possible to send a broadcast invalidate with a bit mask of processor numbers who need to process the invalidation (so if a processor receives such an invalidation it checks its bit in the mask and only has to snoop its cache tags if its bit is set). It is also possible to have a hybrid system that changes behavior depending on system load and/or energy efficiency requirements. It is also possible to combine such a system with other history+predictor mechanisms to handle coherence differently for certain classes of traffic.

Reminds me of: 1) Multicast snoop: https://ieeexplore.ieee.org/document/765959, 2) Destination-set prediction, https://www.cis.upenn.edu/~milom/papers/isca03_destination_set_prediction.pdf

McCalpin, John (Blackbelt) wrote:
(One example is producer-consumer traffic -- when a Load hits a line in M state in another cache, most protocols return the data in Shared state and downgrade the data to Shared state. For producer/consumer "ping-pong" patterns, it is more efficient to transfer the cache line in M state (invalidating it at the previous home), so that the "consumer" can immediately update it.)

This one is really interesting!! The latter seems a specific optimization to the case where a line will be modified by both producer/consumer cores. If only one party will modify the line, returning the cache line to both L3 and requesting core in parallel will a good-enough optimization.

Thank you for spending time explaining, Dr. Bandwidth. It really helps!

McCalpinJohn · ‎07-09-2019

According to uncore PMU manual: "the R3QPI agent implements a latency-reducing optimization for dual sockets which issues snoops within the socket for incoming requests as well as a latency reducing optimization to return data satisfying Direct2Core (D2C) requests." Because it will issue snoop for incoming requests, I guess it is only used by "Home Snoop" mode, right?

SKX only supports "Home Snoop" mode, but includes dynamically adaptive behavior to get most of the latency reduction benefit of "Early Snoop" mode as long as UPI utilization remains low.

I just noticed that the Xeon Scalable Memory Family Uncore Performance Monitoring Reference Manual (336274-001) Section 2.2.10 includes CHA events that distinguish between "directed" and "broadcast" snoops (and between local and remote sources of these snoops). For cache lines in E or M state, only one core (L1+L2) can have the cache line, so a "directed" snoop makes the most sense -- the other caches should not waste tag bandwidth on the lookup. For lines in S state, any number of private caches can have a copy of the line, so a broadcast (on-chip) makes sense. (For coherence over a small number of NUMA nodes (2-4), global broadcasting of snoops on shared lines may be tolerable, but for large NUMA node counts (32++) it is beneficial to track the nodes that (might) have a copy of a shared line, and restrict the snoops to those nodes. The snoop may (or may not) be broadcast *within* each NUMA node, but is typically not broadcast across *all* NUMA nodes.)