Link Copied
On-chip cache-to-cache intervention latency for modified data is generally going up as core counts increase and frequencies remain flat.
I don't see any reason for this to change in the next generation Xeon processors -- the number of cores is expected to increase and the uncore frequency is expected to stay flat or decrease.
In general, implementation changes that allow further increases in throughput (e.g., Cluster on Die) can give slightly better intervention latencies within the "cluster" (i.e., half of the cores in a package), but at the cost of significantly higher cache-to-cache intervention latencies outside of the "cluster". On a Xeon E5-2660-v4 (Broadwell EP, 10-core, 2.0 GHz nominal) the Intel Memory Latency Checker tool reports L2 to L2 modified intervention latency (same socket) of 40.3-42.3 ns, depending on the snoop mode of the processor and the frequency of the uncore. Since only 2 threads are needed to run this test, the cores are probably running at their max Turbo frequency of 3.0 GHz, making this correspond to 121-127 core cycles. If I switch to "Cluster on Die" mode, the value is reduced slightly to 37.6 ns, or about 113 core cycles.
The producer/consumer use case is one that can be helped by forcing the uncore frequency to stay at the maximum value. For high-bandwidth workloads (like STREAM), the "energy efficient Turbo" mechanisms will quickly ramp the uncore frequency to the maximum, but a producer/consumer code may not generate enough traffic to cause the hardware to boost the uncore frequency.
There are lots of ways to make producer/consumer latency better, but to get the best results this needs to be associated with a new, explicitly visible hardware mechanism for low-latency interprocessor communication. I personally think it is inevitable that such mechanisms will be developed, but it is not yet clear whether this will ever happen in the x86 architecture.
John,
This is a rather esoteric question relating to your last post.
I am assuming that the latencies you reported were for the data written to one cache to be read from the other cache. The write-to-read is conceptually one operation, but in fact it is several. The write to one L2 posts an invalidate to the other L2(s). If in the single producer, single consumer case, say, the producer is bursting in a batch of entries (push interval less than L2 to L2 modified intervention latency), does each subsequent push prevent the other core from performing the read (e.g. of the fill pointer) until the burst is complete? If so, then the latency could be much worse.
Jim Dempsey
I usually implement a producer/consumer code using "data" and "flag" in separate cache lines. This enables the consumer to spin on the flag while the producer updates the data. When the data is ready, the producer writes the flag variable. At a low level, the steps are:
There are many variants and many details that can be non-intuitive in an actual implementation. These often involve extra round trips required to ensure ordering in ugly corner cases. A common example is maintaining global ordering across stores that are handled by different coherence controllers. This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.
There are fewer steps in the case where the "data" and "flag" are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line. This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).
Excellent !!
Your discussion provides critical information for use in developing high(er) throughput queue systems. And it indicates that simplified coding of a queue will likely yield a lesser performing function.
Thanks for the information.
Jim Dempsey
@"Dr. Bandwidth"'s excellent #4 response above - answers my question at:
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/509417#com...
So am I to believe that the above cache invalidation strategy is the only way for a program to explicitly atomically transfer
a 64-bit value from one physical (or logical/hyperthread) core to another without going out to RAM and back,
on a Haswell-MB 4910 chip, not having QPI, is the above method in Comment #4 ?
I believe I am correct to assume my mobility Haswell chip does not have a QPI ring bus ?
If I am wrong, is there no way for code to explicitly use the QPI ring bus (beyond arranging that a consumer core is
trying to access an address that has just been written by a producer core to L2/L3 caches, not flushed, and the consumer
is prefetching / reading that address ?
The Core i7-4910MQ does have an internal ring that connects the 4 cores with each other and with the 4 slices of the L3. See, for example, slide 7 of Intel's presentation at https://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.80-Processors2-epub/HC25.27.820-Ha...
The sequence of events for a two-cache-line producer/consumer setup is similar to what I outlined above, except that there is no case where you might need to send invalidate commands to another chip....
Hi Dr. Bandwidth,
Thank you for this extremely detailed explanations. I was exploring this area recently, and I found it very informative. I'm very confused about Intel's underlying cache coherence implementation, I find many mismatches among online resources.
For the above case, I still have couple doubts:
In general, I also have couple doubts about the coherence transaction flow:
The reason I'm asking is I found following descriptions from (https://software.intel.com/en-us/download/intel-xeon-processor-e5-and-e7-v3-family-uncore-performance-monitoring-reference-manual). Especially the part regarding the Cbo. They don't seem to be very clear:
I've also read documents like QPI spec, and MESIF paper. The protocol just seem to differ.
Thank you for your time.
The interaction between the L3 slices and the Home Agents varies by processor family. For processors up to Broadwell, a chip typically had one or two home agents, while Skylake Xeon distributes the Home Agent's responsibility to the "CHA" units -- one of which is co-located with each of the L3 slices. On SKX/CLX, the L3 is no longer inclusive, but the CHA includes a "Snoop Filter" that contains the location information that the L3 directory would have provided if the L3 were inclusive. SKX also enables "memory directories" by default, which can be used to filter remote snoops or otherwise change the timing properties of the local and remote portions of the coherence transaction.
If the memory belongs to a different socket, the entire flow is different. On the initial cache miss, the read request (or RFO request) is not sent to the CHA+L3, but is sent instead to the QPI (or UPI) interface to be sent to the address's "home" socket. At the "home" socket, the read request is sent to the L3 slice that owns the address for processing, etc.
Address to L3 (or CHA+L3) hashing is usually per-socket. (In some processors a socket can be split into multiple NUMA nodes that can be hashed independently, but address hashing is never global.)
For local memory, the "traditional" approach is to send a snoop request to the other socket(s) in parallel with the local L3 lookup. This provides the most overlap and the lowest latency. Unfortunately as processors have gotten faster, the snoop traffic can use a large fraction of the chip-to-chip interconnect bandwidth. So SKX enables "memory directories" by default. If the UPI links are busy, the processor can wait to get the cache line from local memory before deciding on the remote snoop. If the directory bit says that no other chip has permission to have a dirty copy of the line, then the data from memory is valid, and (for reads) no remote snoop is required. This reduces the link utilization at the cost of higher latency for loads that do need to snoop the other socket.
It is certainly true that the documentation is very vague.... The most interesting "documentation" is implicit in the uncore performance counter events -- particularly the opcode matching functions.
Hi Dr. Bandwidth,
Thank you for your reply.
McCalpin, John (Blackbelt) wrote:If the memory belongs to a different socket, the entire flow is different. On the initial cache miss, the read request (or RFO request) is not sent to the CHA+L3, but is sent instead to the QPI (or UPI) interface to be sent to the address's "home" socket. At the "home" socket, the read request is sent to the L3 slice that owns the address for processing, etc.
Address to L3 (or CHA+L3) hashing is usually per-socket. (In some processors a socket can be split into multiple NUMA nodes that can be hashed independently, but address hashing is never global.)
I'm a little confused about the above two paragraphs. The first paragraph seems to indicate that an initial L2 miss is able to send read request (or RFO) directly to QPI without sending requests to local socket's L3 slice. That implies address to L3 hashing is global. For example, the physical address [0-256MB] only maps to a L3 slide at socket 1, but not any slices at socket 0. While your second paragraph says the hashing is per-socket. Which one is correct?
McCalpin, John (Blackbelt) wrote:For local memory, the "traditional" approach is to send a snoop request to the other socket(s) in parallel with the local L3 lookup. This provides the most overlap and the lowest latency. Unfortunately as processors have gotten faster, the snoop traffic can use a large fraction of the chip-to-chip interconnect bandwidth.
Are you referring to the "Source Snoop" model here? Can we assume "source snoop" has the following general steps: 1) L2 cache controller will broadcast snoop messages to all other cores' L2 cache controller, also send snoop to local L3 and QPI (for other sockets). 2) Other parties will reply the snoop, they will send reply to either the L3 slice or the home agent (which is unsure).
McCalpin, John (Blackbelt) wrote:It is certainly true that the documentation is very vague.... The most interesting "documentation" is implicit in the uncore performance counter events -- particularly the opcode matching functions
Indeed. I probe around the PMU manuals, seem to find a lot related events/opcodes etc, but it only makes me more confusing. For models before Skylake Scalable, the L3 caching agent and home agent seem to be able to provide similar functionalaties, as in both of them have directory (is this correct?), that means both of them can serialize transactions if conflicts.
I'm more leaning towards to believe: for the home snoop model, L2 send requests to L3 caching agent (either local socket or remote socket via QPI). Since L3 caching agent can have directory, it is able to send minimum snoop traffic. While the home agent holds the final decision if a coherence transaction complete or not. That means home agent will send the final ACK to L2 core. Do you think this is correct?
Thank you.
Agents within the processor that need to route address-related requests all have a copy of the "Source Address Decoder" (SAD) registers. These allow the agent to determine the NUMA node responsible for any physical address. So on an L2 miss, the interface to the ring uses the SAD registers to decide how to process the request:
The "Target Address Decoder" (TAD) registers are used are used within each NUMA node to determine the mapping of physical addresses to memory controllers and memory channels.
There is a nice report at https://publikationen.bibliothek.kit.edu/1000073678 that provides an overview of the address translation process on Xeon processors.
The "snoop modes" that are available on some processors (notably Haswell/Broadwell) change some of the transactions.
Intel processors don't broadcast snoops to the other L2 caches on L2 misses -- there is nowhere near enough L2 tag bandwidth to handle all these requests. This is where the address hash is critical! Each physical address is owned by exactly one L3 slice (and one Home Agent), so on an L2 miss (to a local address), the request is sent to the L3 slice that owns the address. For processors with inclusive L3 caches, that L3 slice is guaranteed to have tracking information for the cache line address if any local core has a copy of the line in its private L1 or L2 caches. For processors without inclusive L3 caches (Skylake Xeon, Cascade Lake Xeon), there is a "snoop filter" in each CHA that is guaranteed to have the same tracking information for all lines in (local) L1 and/or L2 caches. For dirty lines, the L3 (or snoop filter) sends an intervention request to the core that has permission to have a dirty copy of the line. For widely shared cache lines there are several options for processing. Some systems track all sharers and send a separate invalidate message to each. Some systems send a full broadcast invalidation. It is also possible to send a broadcast invalidate with a bit mask of processor numbers who need to process the invalidation (so if a processor receives such an invalidation it checks its bit in the mask and only has to snoop its cache tags if its bit is set). It is also possible to have a hybrid system that changes behavior depending on system load and/or energy efficiency requirements. It is also possible to combine such a system with other history+predictor mechanisms to handle coherence differently for certain classes of traffic. (One example is producer-consumer traffic -- when a Load hits a line in M state in another cache, most protocols return the data in Shared state and downgrade the data to Shared state. For producer/consumer "ping-pong" patterns, it is more efficient to transfer the cache line in M state (invalidating it at the previous home), so that the "consumer" can immediately update it.)
As you note, there are also multiple possible paths for the return traffic. Maintaining consistency is easier if the data is sent to the L3 first, and then to the requesting core, but it is also possible to send to both at the same time (e.g., "Direct2Core"). In recent processors, these return paths are chosen dynamically based on undocumented states and settings of the processor.
McCalpin, John (Blackbelt) wrote:Agents within the processor that need to route address-related requests all have a copy of the "Source Address Decoder" (SAD) registers. These allow the agent to determine the NUMA node responsible for any physical address. So on an L2 miss, the interface to the ring uses the SAD registers to decide how to process the request:
- If the address is local, it is hashed to determine which L3 (or L3+CHA) needs to handle the request.
- If the SAD registers indicate that the address belongs to another NUMA node, the request is sent to the appropriate QPI/UPI interface, where it is transported to the "home" node for the address.
Got it. This further explains the case even when data is cached in L3, using local socket's memory is faster than remote socket's. Simply because the latter requires all coherence traffic go through QPI/UPI.
McCalpin, John (Blackbelt) wrote:The "snoop modes" that are available on some processors (notably Haswell/Broadwell) change some of the transactions.
- In "Early Snoop" mode, an L3 miss from a core in socket 0 to an address in socket 1 will send a "snoop" request from socket 0 to socket 1.
- In "Home Snoop" mode, an L3 miss from a core in socket 0 to an address in socket 1 will send a "read request" from socket 0 to socket 1 -- then socket 1 will generate any required snoops.
- In a 2-socket system, no additional snoops are required -- the request already missed in socket 0, and in socket 1 the read request will be processed by the L3/CHA slice that owns the address.
- In a four-socket system, socket 1 would be responsible for sending snoops to sockets 2 and 3 (unless a directory indicated that these snoops were not necessary).
- Many variations are possible in both the flow and the timing. Some of the details can be reverse-engineered using the uncore performance counters, but it is important to note that Intel's recent processors have dynamically adaptive behavior. The transaction sequence that you see in a microbenchmark may change for processors with different core counts, for the same processor with higher or lower loading on the ring -- or potentially even for different settings of the energy-performance bias....
Clear enough. No matter what models we are using, L2 won't be the initiator of any coherence traffic. An L2 miss will go to its corresponding L3 first, who will act on L2's behalf. Knowing L3 is always the initiator makes things more clear.
According to uncore PMU manual: "the R3QPI agent implements a latency-reducing optimization for dual sockets which issues snoops within the socket for incoming requests as well as a latency reducing optimization to return data satisfying Direct2Core (D2C) requests." Because it will issue snoop for incoming requests, I guess it is only used by "Home Snoop" mode, right?
McCalpin, John (Blackbelt) wrote:It is also possible to send a broadcast invalidate with a bit mask of processor numbers who need to process the invalidation (so if a processor receives such an invalidation it checks its bit in the mask and only has to snoop its cache tags if its bit is set). It is also possible to have a hybrid system that changes behavior depending on system load and/or energy efficiency requirements. It is also possible to combine such a system with other history+predictor mechanisms to handle coherence differently for certain classes of traffic.
Reminds me of: 1) Multicast snoop: https://ieeexplore.ieee.org/document/765959, 2) Destination-set prediction, https://www.cis.upenn.edu/~milom/papers/isca03_destination_set_prediction.pdf
McCalpin, John (Blackbelt) wrote:(One example is producer-consumer traffic -- when a Load hits a line in M state in another cache, most protocols return the data in Shared state and downgrade the data to Shared state. For producer/consumer "ping-pong" patterns, it is more efficient to transfer the cache line in M state (invalidating it at the previous home), so that the "consumer" can immediately update it.)
This one is really interesting!! The latter seems a specific optimization to the case where a line will be modified by both producer/consumer cores. If only one party will modify the line, returning the cache line to both L3 and requesting core in parallel will a good-enough optimization.
Thank you for spending time explaining, Dr. Bandwidth. It really helps!
According to uncore PMU manual: "the R3QPI agent implements a latency-reducing optimization for dual sockets which issues snoops within the socket for incoming requests as well as a latency reducing optimization to return data satisfying Direct2Core (D2C) requests." Because it will issue snoop for incoming requests, I guess it is only used by "Home Snoop" mode, right?
SKX only supports "Home Snoop" mode, but includes dynamically adaptive behavior to get most of the latency reduction benefit of "Early Snoop" mode as long as UPI utilization remains low.
I just noticed that the Xeon Scalable Memory Family Uncore Performance Monitoring Reference Manual (336274-001) Section 2.2.10 includes CHA events that distinguish between "directed" and "broadcast" snoops (and between local and remote sources of these snoops). For cache lines in E or M state, only one core (L1+L2) can have the cache line, so a "directed" snoop makes the most sense -- the other caches should not waste tag bandwidth on the lookup. For lines in S state, any number of private caches can have a copy of the line, so a broadcast (on-chip) makes sense. (For coherence over a small number of NUMA nodes (2-4), global broadcasting of snoops on shared lines may be tolerable, but for large NUMA node counts (32++) it is beneficial to track the nodes that (might) have a copy of a shared line, and restrict the snoops to those nodes. The snoop may (or may not) be broadcast *within* each NUMA node, but is typically not broadcast across *all* NUMA nodes.)
For more complete information about compiler optimizations, see our Optimization Notice.