Hi all,
I ve some questions related to DRAM management and IMC:
Can someone describe (or point me to documentation) about what exactly happens when a memory access to a remote NUMA node happens?
Secondly, is the refresh rate of DRAM handled by the IMC of the corresponding CPU?
What happens when I have a machine with 2 NUMA nodes and one CPU (I know this is strange), e.g. let's say I boot my linux box forcing it to see only the cores of the first CPU. In this scenario, how do the answers to the above two questions change (if they actually change)?
Thanks a lot for your time, in advance.
Cheers,
Babis
Link Copied
As far as I can tell, the very low-level details have never been published, but the overall flow is straightforward.
Recent Intel processors can operate in several different snooping modes, which changes the transactions involved in remote accesses. Obviously the same end result is obtained, but the transactions are different (and therefore show up in different performance counter events), the ordering of the transactions is different, and the latency and throughput are different.
For Ivy Bridge or Haswell 2-socket systems operating in "Home Snoop" mode, the procedure for a load from an address that is mapped to a remote NUMA node is (approximately):
The second question is much easier. DRAM refresh is handled by the DRAM controller attached to the DRAM channel. In Intel platforms the DRAM channel controllers are considered part of the iMC. (Each iMC controls 2, 3, or 4 DRAM channels, depending on the platform and the DRAM configuration. This is described in the "Uncore Performance Monitoring Guide" for each platform.
The third question probably depends on exactly which platform you are using and what mechanism was used to disable the processors on one NUMA node. Assuming that these are ordinary Intel processors, the behavior should follow the outline above. If you do not disable "C1E state", the performance for the remote memory access is likely to be low. In "C1E state", if none of the cores on a chip are being used, the cores are all clocked down to the minimum speed. Most systems run the "uncore" at a speed no faster than the fastest core, so it will also be at the minimum speed. Since the "uncore" contains the ring, the L3, the Home Agent, and the iMCs, this can significantly increase latency and reduce bandwidth. For Haswell (Xeon E5 v3) you can set the uncore frequency to "maximum", which decouples it from the core frequencies. In this mode I have not seen the same slowdown due to C1E state that I saw on earlier (Sandy Bridge) processors.
Thanks a lot for the thorough reply.
We work on a SandyBridge machine Xeon E5 2600. I don't know if that changes the behavior you described it before. Also, we disable the CPU1 using the nr_cpus command line boot parameter for the linux kernel.
In the above mentioned scenario, if we want to change the refresh rate on NUMA node 1, we still access the IMC of CPU1 (even if the cores there are not used by Linux)?
Thanks a lot,
Babis
The Sandy Bridge generation uses a slightly different coherence protocol, but the overall behavior is similar.
I am not sure how configuration accesses are going to work if all of the cores on socket 1 are disabled. PCI configuration space is probably OK -- it is accessed by physical address, so it is probably possible to access those locations using a CPU core on socket 0. MSR configuration space can only be accessed by cores running in the same socket, so if the kernel does not know about any cores on socket 1, then it might not be able to access MSRs in socket 1. On the other hand, it might know about the cores, but know that it is not supposed to use them for scheduling ordinary tasks. In this case it might still be able to use them for MSR accesses.
That was very useful. Thanks John.
Thanks for all the great info. There are a few cases that still confuse me. I'm going to use SKX CHA/SF/LLC terminology, but I believe the same logic applies to the HA/LLC of HSW except where noted.
(1) Parallel main memory accesses and snoops
I'm confused by the following...
In most cases, the Home Agent on the Home node will also request the cache line from the Memory Controller (iMC) that owns the cache line -- in parallel with the local (and possibly remote) L3 snoop requests.
I understand why the memory access would occur in parallel with a *remote* snoop, assuming sufficient QPI/UPI bandwidth is available. However, why would a memory access ever occur in parallel with a *local* snoop? I would think that the CHA would only initiate a memory access after/if the request misses in the local SF/LLC, regardless of whether the request stems from a local or remote core.
(2) HitME cache and memory accesses (assuming SKX coherence mechanisms, e.g., home snoop w/ memory directory)
Does a hit in the HitME cache *prevent* the memory directory read (and just result in a remote snoop) or does the directory read still occur (with the hit just resulting in a "quicker" snoop)? I'm not sure how entries in the HitME cache are invalidated/kept coherent, so if the remote snoop isn't guaranteed to return the data, I could see the case for still accessing main memory.
On that note, if anyone has any info on the the size/distribution of the HitME cache(s) given the distributed CHA network on SKX, that would be much appreciated...
(3) L2 miss, address maps to CHA on a remote socket
Text from this thread indicates that a miss is only sent across the QPI link (to the remote CHA) *after* the local SF/LLC for the address has also confirmed a miss. This makes sense to me, as otherwise, the remote CHA would potentially have to send a snoop right back to the local socket. However, text from a more recent thread...
If the address is local, it is hashed to determine which L3 (or L3+CHA) needs to handle the request... If the SAD registers indicate that the address belongs to another NUMA node, the request is sent to the appropriate QPI/UPI interface, where it is transported to the "home" node for the address.
...seems to indicate that the miss goes to the remote CHA *before* checking the local SF/LLC slice. Am I misunderstanding?
I'm confused by the following...
In most cases, the Home Agent on the Home node will also request the cache line from the Memory Controller (iMC) that owns the cache line -- in parallel with the local (and possibly remote) L3 snoop requests.I understand why the memory access would occur in parallel with a *remote* snoop, assuming sufficient QPI/UPI bandwidth is available. However, why would a memory access ever occur in parallel with a *local* snoop? I would think that the CHA would only initiate a memory access after/if the request misses in the local SF/LLC, regardless of whether the request stems from a local or remote core.
My original comments were in the context of Haswell, for which a Home Agent is associated with each Memory Controller (rather than being co-located with the LLC slices, as in SKX).
I would guess that recent mainstream Intel processors don't speculatively begin setting up for a memory access before the LLC is probed, but this choice depends on the size of the LLC, the LLC latency, the available memory bandwidth, the protocol support for cancelling a speculative memory request, etc. If the LLC tags are off-chip, for example, waiting to confirm an LLC miss could result in excessive latency to memory. Speculative overlap of memory access could be implemented at several different "levels", such as:
For SKX/CLX, getting the local DRAM transaction started as early as possible is important because of the use of "memory directories" in all 2s and larger configurations. One or more "memory directory" bits are hidden in the ECC bits in DRAM to indicate whether it is possible for an address to have a dirty copy in another socket.
Under low UPI utilization, the snoop can be sent to the other socket in parallel with the local DRAM request.
Under high UPI utilization, the snoop can be deferred until after the local memory access is complete and the memory directory can be examined.
For L2 misses to remotely-homed addresses, there are again several possible orderings of the transactions that could be chosen based on how busy the caches and UPI links happen to be.
For more complete information about compiler optimizations, see our Optimization Notice.