Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Beginner
81 Views

IMC and remote memory nodes

Hi all,

I ve some questions related to DRAM management and IMC:

Can someone describe (or point me to documentation) about what exactly happens when a memory access to a remote NUMA node happens?

Secondly, is the refresh rate of DRAM handled by the IMC of the corresponding CPU?

What happens when I have a machine with 2 NUMA nodes and one CPU (I know this is strange), e.g. let's say I boot my linux box forcing it to see only the cores of the first CPU. In this scenario, how do the answers to the above two questions change (if they actually change)?

Thanks a lot for your time, in advance.

 

Cheers,

Babis

0 Kudos
4 Replies
Black Belt
81 Views

As far as I can tell, the very low-level details have never been published, but the overall flow is straightforward.

Recent Intel processors can operate in several different snooping modes, which changes the transactions involved in remote accesses.  Obviously the same end result is obtained, but the transactions are different (and therefore show up in different performance counter events), the ordering of the transactions is different, and the latency and throughput are different.

For Ivy Bridge or Haswell 2-socket systems operating in "Home Snoop" mode, the procedure for a load from an address that is mapped to a remote NUMA node is (approximately):

  • Look for the the data in your own L1 and L2 caches.
  • Put a load request on the ring in your chip to look for the data in the L3 slice that is responsible for that physical address.
    • Note that the hash that determines which L3 slice is responsible for a physical address is not published.
  • The L3 cache indicates a "miss".
  • Some unit on the ring looks up the address in a table to figure out which NUMA node owns the address, and determines which QPI port should be used to send the request to the Home Node of that memory location.
  • A Read transaction is sent on the QPI link to the Home Node for the requested memory location.
    • This may require multiple "hops", routing through intermediate nodes or switches on the way to the destination.
  • The Read transaction arrives at the Home node and is placed on the Home node's ring to make its way to the Home Agent in that chip that owns the address being requested.
  • The Home Agent on the Home node sends out a snoop request to the slice of the local L3 cache that is responsible for the requested address, and (probably concurrently) sends out a snoop request to every other NUMA node in the system (except the one that sent the Read request, since the existence of the Read request means that the data was not in the L3 cache there).
    • In a 2-node system, this means that there will be no snoop requests on the QPI interface, since there are no other NUMA nodes -- just the Home and the Requester.
  • In most cases, the Home Agent on the Home node will also request the cache line from the Memory Controller (iMC) that owns the cache line -- in parallel with the local (and possibly remote) L3 snoop requests.
    • There are lots of implementation choices available here for exactly what goes in parallel and how or whether speculative transactions that have not yet executed (e.g., due to queuing delays in the iMC) are cancelled.
  • The L3 cache slice on the Home node indicates a "miss", as do any remote NUMA nodes.
  • The Read response from the DRAM (including the data) is sent to the Requesting node.
    • Some protocols combine the Snoop Response with the Read Response, so that the Requesting node gets only one response.  That response will include the data and an indication that the data is valid.
    • Some protocols send the Read Response from DRAM to the Requester without waiting for the Snoop Response.   The Snoop Response(s) are sent to the Requester separately (one for each NUMA node outside the Requesting node).  The Requester must wait for all Snoop Responses to indicate "miss" or "clean" before using the data. 

 

The second question is much easier.  DRAM refresh is handled by the DRAM controller attached to the DRAM channel.  In Intel platforms the DRAM channel controllers are considered part of the iMC.  (Each iMC controls 2, 3, or 4 DRAM channels, depending on the platform and the DRAM configuration.  This is described in the "Uncore Performance Monitoring Guide" for each platform.

The third question probably depends on exactly which platform you are using and what mechanism was used to disable the processors on one NUMA node.   Assuming that these are ordinary Intel processors, the behavior should follow the outline above.  If you do not disable "C1E state", the performance for the remote memory access is likely to be low.   In "C1E state", if none of the cores on a chip are being used, the cores are all clocked down to the minimum speed.  Most systems run the "uncore" at a speed no faster than the fastest core, so it will also be at the minimum speed.  Since the "uncore" contains the ring, the L3, the Home Agent, and the iMCs, this can significantly increase latency and reduce bandwidth.  For Haswell (Xeon E5 v3) you can set the uncore frequency to "maximum", which decouples it from the core frequencies.  In this mode I have not seen the same slowdown due to C1E state that I saw on earlier (Sandy Bridge) processors.

0 Kudos
Beginner
81 Views

Thanks a lot for the thorough reply.

We work on a SandyBridge machine Xeon E5 2600. I don't know if that changes the behavior you described it before. Also, we disable the CPU1 using the nr_cpus command line boot parameter for the linux kernel.

In the above mentioned scenario, if we want to change the refresh rate on NUMA node 1, we still access the IMC of CPU1 (even if the cores there are not used by Linux)?

Thanks a lot,

Babis

 

0 Kudos
Black Belt
81 Views

The Sandy Bridge generation uses a slightly different coherence protocol, but the overall behavior is similar. 

I am not sure how configuration accesses are going to work if all of the cores on socket 1 are disabled.  PCI configuration space is probably OK -- it is accessed by physical address, so it is probably possible to access those locations using a CPU core on socket 0.  MSR configuration space can only be accessed by cores running in the same socket, so if the kernel does not know about any cores on socket 1, then it might not be able to access MSRs in socket 1.    On the other hand, it might know about the cores, but know that it is not supposed to use them for scheduling ordinary tasks. In this case it might still be able to use them for MSR accesses.

0 Kudos
Beginner
81 Views

That was very useful. Thanks John.

0 Kudos