I ve some questions related to DRAM management and IMC:
Can someone describe (or point me to documentation) about what exactly happens when a memory access to a remote NUMA node happens?
Secondly, is the refresh rate of DRAM handled by the IMC of the corresponding CPU?
What happens when I have a machine with 2 NUMA nodes and one CPU (I know this is strange), e.g. let's say I boot my linux box forcing it to see only the cores of the first CPU. In this scenario, how do the answers to the above two questions change (if they actually change)?
Thanks a lot for your time, in advance.
As far as I can tell, the very low-level details have never been published, but the overall flow is straightforward.
Recent Intel processors can operate in several different snooping modes, which changes the transactions involved in remote accesses. Obviously the same end result is obtained, but the transactions are different (and therefore show up in different performance counter events), the ordering of the transactions is different, and the latency and throughput are different.
For Ivy Bridge or Haswell 2-socket systems operating in "Home Snoop" mode, the procedure for a load from an address that is mapped to a remote NUMA node is (approximately):
- Look for the the data in your own L1 and L2 caches.
- Put a load request on the ring in your chip to look for the data in the L3 slice that is responsible for that physical address.
- Note that the hash that determines which L3 slice is responsible for a physical address is not published.
- The L3 cache indicates a "miss".
- Some unit on the ring looks up the address in a table to figure out which NUMA node owns the address, and determines which QPI port should be used to send the request to the Home Node of that memory location.
- A Read transaction is sent on the QPI link to the Home Node for the requested memory location.
- This may require multiple "hops", routing through intermediate nodes or switches on the way to the destination.
- The Read transaction arrives at the Home node and is placed on the Home node's ring to make its way to the Home Agent in that chip that owns the address being requested.
- The Home Agent on the Home node sends out a snoop request to the slice of the local L3 cache that is responsible for the requested address, and (probably concurrently) sends out a snoop request to every other NUMA node in the system (except the one that sent the Read request, since the existence of the Read request means that the data was not in the L3 cache there).
- In a 2-node system, this means that there will be no snoop requests on the QPI interface, since there are no other NUMA nodes -- just the Home and the Requester.
- In most cases, the Home Agent on the Home node will also request the cache line from the Memory Controller (iMC) that owns the cache line -- in parallel with the local (and possibly remote) L3 snoop requests.
- There are lots of implementation choices available here for exactly what goes in parallel and how or whether speculative transactions that have not yet executed (e.g., due to queuing delays in the iMC) are cancelled.
- The L3 cache slice on the Home node indicates a "miss", as do any remote NUMA nodes.
- The Read response from the DRAM (including the data) is sent to the Requesting node.
- Some protocols combine the Snoop Response with the Read Response, so that the Requesting node gets only one response. That response will include the data and an indication that the data is valid.
- Some protocols send the Read Response from DRAM to the Requester without waiting for the Snoop Response. The Snoop Response(s) are sent to the Requester separately (one for each NUMA node outside the Requesting node). The Requester must wait for all Snoop Responses to indicate "miss" or "clean" before using the data.
The second question is much easier. DRAM refresh is handled by the DRAM controller attached to the DRAM channel. In Intel platforms the DRAM channel controllers are considered part of the iMC. (Each iMC controls 2, 3, or 4 DRAM channels, depending on the platform and the DRAM configuration. This is described in the "Uncore Performance Monitoring Guide" for each platform.
The third question probably depends on exactly which platform you are using and what mechanism was used to disable the processors on one NUMA node. Assuming that these are ordinary Intel processors, the behavior should follow the outline above. If you do not disable "C1E state", the performance for the remote memory access is likely to be low. In "C1E state", if none of the cores on a chip are being used, the cores are all clocked down to the minimum speed. Most systems run the "uncore" at a speed no faster than the fastest core, so it will also be at the minimum speed. Since the "uncore" contains the ring, the L3, the Home Agent, and the iMCs, this can significantly increase latency and reduce bandwidth. For Haswell (Xeon E5 v3) you can set the uncore frequency to "maximum", which decouples it from the core frequencies. In this mode I have not seen the same slowdown due to C1E state that I saw on earlier (Sandy Bridge) processors.
Thanks for all the great info. There are a few cases that still confuse me. I'm going to use SKX CHA/SF/LLC terminology, but I believe the same logic applies to the HA/LLC of HSW except where noted.
(1) Parallel main memory accesses and snoops
I'm confused by the following...
In most cases, the Home Agent on the Home node will also request the cache line from the Memory Controller (iMC) that owns the cache line -- in parallel with the local (and possibly remote) L3 snoop requests.
I understand why the memory access would occur in parallel with a *remote* snoop, assuming sufficient QPI/UPI bandwidth is available. However, why would a memory access ever occur in parallel with a *local* snoop? I would think that the CHA would only initiate a memory access after/if the request misses in the local SF/LLC, regardless of whether the request stems from a local or remote core.
(2) HitME cache and memory accesses (assuming SKX coherence mechanisms, e.g., home snoop w/ memory directory)
Does a hit in the HitME cache *prevent* the memory directory read (and just result in a remote snoop) or does the directory read still occur (with the hit just resulting in a "quicker" snoop)? I'm not sure how entries in the HitME cache are invalidated/kept coherent, so if the remote snoop isn't guaranteed to return the data, I could see the case for still accessing main memory.
On that note, if anyone has any info on the the size/distribution of the HitME cache(s) given the distributed CHA network on SKX, that would be much appreciated...
(3) L2 miss, address maps to CHA on a remote socket
Text from this thread indicates that a miss is only sent across the QPI link (to the remote CHA) *after* the local SF/LLC for the address has also confirmed a miss. This makes sense to me, as otherwise, the remote CHA would potentially have to send a snoop right back to the local socket. However, text from a more recent thread...
If the address is local, it is hashed to determine which L3 (or L3+CHA) needs to handle the request... If the SAD registers indicate that the address belongs to another NUMA node, the request is sent to the appropriate QPI/UPI interface, where it is transported to the "home" node for the address.
...seems to indicate that the miss goes to the remote CHA *before* checking the local SF/LLC slice. Am I misunderstanding?
I'm confused by the following...In most cases, the Home Agent on the Home node will also request the cache line from the Memory Controller (iMC) that owns the cache line -- in parallel with the local (and possibly remote) L3 snoop requests.I understand why the memory access would occur in parallel with a *remote* snoop, assuming sufficient QPI/UPI bandwidth is available. However, why would a memory access ever occur in parallel with a *local* snoop? I would think that the CHA would only initiate a memory access after/if the request misses in the local SF/LLC, regardless of whether the request stems from a local or remote core.
My original comments were in the context of Haswell, for which a Home Agent is associated with each Memory Controller (rather than being co-located with the LLC slices, as in SKX).
I would guess that recent mainstream Intel processors don't speculatively begin setting up for a memory access before the LLC is probed, but this choice depends on the size of the LLC, the LLC latency, the available memory bandwidth, the protocol support for cancelling a speculative memory request, etc. If the LLC tags are off-chip, for example, waiting to confirm an LLC miss could result in excessive latency to memory. Speculative overlap of memory access could be implemented at several different "levels", such as:
- The CHA might speculatively reserve an outbound (to IMC) buffer before knowing whether it will be needed.
- This mechanism could be dynamic in several ways, e.g., using a history-based predictor of the likelihood of a miss, and/or behaving more aggressively at low levels of buffer utilization.
- The CHA might speculatively send the memory controller a "hint" that a requests for this address is likely to be requested soon.
- This would help the memory controller decide which open pages to close and which to keep open.
- An aggressive memory controller might use such hints as inputs to its own autonomous prefetcher -- speculatively reading neighboring cache lines into its own read buffers while the page is open. (This is similar to "region-based prefetching" in the literature.)
- This is in some ways analogous to the "LLC replacement hints" that Intel has used in the past to help prevent the inclusive LLC from evicting lines that are actively in use in the L1 and/or L2 caches.
- The prioritization of such speculative reads can be counterintuitive -- an LLC miss in the middle of a stream of LLC hits may result in a significant (potentially avoidable) stall because the prefetchers may be predicting a hit and not generating a prefetch early enough. This can be a relatively common occurrence with Snoop Filter conflicts, for example.
For SKX/CLX, getting the local DRAM transaction started as early as possible is important because of the use of "memory directories" in all 2s and larger configurations. One or more "memory directory" bits are hidden in the ECC bits in DRAM to indicate whether it is possible for an address to have a dirty copy in another socket.
Under low UPI utilization, the snoop can be sent to the other socket in parallel with the local DRAM request.
- If the memory directory bit says that the line cannot be dirty elsewhere, then there is no need to wait for the snoop response, and the value from local memory can be used.
- If the memory directory bit says the line might be dirty elsewhere, the line cannot be used until the snoop response has been returned.
Under high UPI utilization, the snoop can be deferred until after the local memory access is complete and the memory directory can be examined.
- If the memory directory bit says that the line cannot be dirty elsewhere, then there is no need to ever send the snoop request.
- This allows more of the UPI bandwidth to be used for data transfers (rather than snoops and responses).
- memory directory bit says that the line cannot be dirty elsewhere, then there is no need to wait for the snoop response
- memory directory bit says that the line cannot be dirty elsewhere, then there is no need to wait for the snoop response
For L2 misses to remotely-homed addresses, there are again several possible orderings of the transactions that could be chosen based on how busy the caches and UPI links happen to be.
Thanks a lot for the thorough reply.
We work on a SandyBridge machine Xeon E5 2600. I don't know if that changes the behavior you described it before. Also, we disable the CPU1 using the nr_cpus command line boot parameter for the linux kernel.
In the above mentioned scenario, if we want to change the refresh rate on NUMA node 1, we still access the IMC of CPU1 (even if the cores there are not used by Linux)?
Thanks a lot,
The Sandy Bridge generation uses a slightly different coherence protocol, but the overall behavior is similar.
I am not sure how configuration accesses are going to work if all of the cores on socket 1 are disabled. PCI configuration space is probably OK -- it is accessed by physical address, so it is probably possible to access those locations using a CPU core on socket 0. MSR configuration space can only be accessed by cores running in the same socket, so if the kernel does not know about any cores on socket 1, then it might not be able to access MSRs in socket 1. On the other hand, it might know about the cores, but know that it is not supposed to use them for scheduling ordinary tasks. In this case it might still be able to use them for MSR accesses.