I ve some questions related to DRAM management and IMC:
Can someone describe (or point me to documentation) about what exactly happens when a memory access to a remote NUMA node happens?
Secondly, is the refresh rate of DRAM handled by the IMC of the corresponding CPU?
What happens when I have a machine with 2 NUMA nodes and one CPU (I know this is strange), e.g. let's say I boot my linux box forcing it to see only the cores of the first CPU. In this scenario, how do the answers to the above two questions change (if they actually change)?
Thanks a lot for your time, in advance.
As far as I can tell, the very low-level details have never been published, but the overall flow is straightforward.
Recent Intel processors can operate in several different snooping modes, which changes the transactions involved in remote accesses. Obviously the same end result is obtained, but the transactions are different (and therefore show up in different performance counter events), the ordering of the transactions is different, and the latency and throughput are different.
For Ivy Bridge or Haswell 2-socket systems operating in "Home Snoop" mode, the procedure for a load from an address that is mapped to a remote NUMA node is (approximately):
The second question is much easier. DRAM refresh is handled by the DRAM controller attached to the DRAM channel. In Intel platforms the DRAM channel controllers are considered part of the iMC. (Each iMC controls 2, 3, or 4 DRAM channels, depending on the platform and the DRAM configuration. This is described in the "Uncore Performance Monitoring Guide" for each platform.
The third question probably depends on exactly which platform you are using and what mechanism was used to disable the processors on one NUMA node. Assuming that these are ordinary Intel processors, the behavior should follow the outline above. If you do not disable "C1E state", the performance for the remote memory access is likely to be low. In "C1E state", if none of the cores on a chip are being used, the cores are all clocked down to the minimum speed. Most systems run the "uncore" at a speed no faster than the fastest core, so it will also be at the minimum speed. Since the "uncore" contains the ring, the L3, the Home Agent, and the iMCs, this can significantly increase latency and reduce bandwidth. For Haswell (Xeon E5 v3) you can set the uncore frequency to "maximum", which decouples it from the core frequencies. In this mode I have not seen the same slowdown due to C1E state that I saw on earlier (Sandy Bridge) processors.
Thanks a lot for the thorough reply.
We work on a SandyBridge machine Xeon E5 2600. I don't know if that changes the behavior you described it before. Also, we disable the CPU1 using the nr_cpus command line boot parameter for the linux kernel.
In the above mentioned scenario, if we want to change the refresh rate on NUMA node 1, we still access the IMC of CPU1 (even if the cores there are not used by Linux)?
Thanks a lot,
The Sandy Bridge generation uses a slightly different coherence protocol, but the overall behavior is similar.
I am not sure how configuration accesses are going to work if all of the cores on socket 1 are disabled. PCI configuration space is probably OK -- it is accessed by physical address, so it is probably possible to access those locations using a CPU core on socket 0. MSR configuration space can only be accessed by cores running in the same socket, so if the kernel does not know about any cores on socket 1, then it might not be able to access MSRs in socket 1. On the other hand, it might know about the cores, but know that it is not supposed to use them for scheduling ordinary tasks. In this case it might still be able to use them for MSR accesses.