Q on TLB, Cache and Memory Timings

Jonathon_D_ · ‎10-23-2015

I'm putting together a lecture on paging for my operating systems class, and our textbook (Silberschatz) gives a overly simplistic example of calculating the effective access time for memory.

I have been trying to gather some more up-to-date information so that students will see the relative impact of different parts of the hardware, but I am doing a poor job. I have no idea what the average TLB access time on the Broadwell is. I just need something in the ballpark. With the larger L2 TLB I am guessing that the hit rate is over 99% on average.

Where I really got into trouble (and where I must confess my relative cluelessness) is calculating the memory access time. I thought I could find some nice DDR3 RAM on Newegg and go from there. A CAS7 part would require 7 memory clocks. And I read that DDR3 on the i7 downclocks from 1600 to 1066. But that works out to 13ns. Is that correct? It looks like a good speed for cache, but I realize it isn't local. Are there bus cycles I need to add in?

I would greatly appreciate it if someone would straighten me out so that I can put together a good worked example.

Jonathon Doran

McCalpinJohn · ‎10-25-2015

I wrote up some notes on this a while back...

http://sites.utexas.edu/jdm4372/2011/03/10/memory-latency-components/

The analysis is still applicable for single-socket systems.

In multi-socket systems the local memory latency is the larger of the time required to get the data and the time required to obtain a snoop response from the other socket(s) in the system, and the latter dominates in all of the 2-socket and larger sockets that I know of....

Latency is also increasing over time as the number of frequency domains increases, since this results in more asynchronous boundary crossings.

TLB access time is not directly visible --- it is fully overlapped with L1 Data Cache access. Intel Optimization Reference Manual (document 248966) mentions a 7-cycle penalty for DTLB (or ITLB) misses that hit in the STLB. STLB's are getting rapidly larger -- 512 entries for Sandy Bridge, 1024 entries for Haswell, 1536 entries for Broadwell and for Skylake.

McCalpinJohn · ‎10-26-2015

I forgot to add that the combination of stalled (or slightly declining) frequencies with increasing core counts has led to more complex on-chip interconnects and more complex shared cache structures -- both of which increase average latency for shared cache hits and for shared cache misses.

For 2 cores it is easy to implement a monolithic shared cache, while for 4 cores a monolithic shared cache may remain a viable option. More recently, even 4 core processors have switched to a distributed shared cache -- e.g., http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf shows a 4-slice L3 for the 4-core "Second Generation Core i7/i5/i3" family (based on the Sandy Bridge core).

For the higher-core-count server processors:

The first generation Xeon E5 (Sandy Bridge EP) has a ring with 8 cores and 8 L3 slices.
The second generation Xeon E5 (Ivy Bridge EP) is very similar, but with up to 12 cores.
The Xeon E7 based on the Ivy Bridge core has multiple rings to support up to 15 cores, as described in http://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-8-Big-Iron-Servers-epub/HC26.13.832-IvyBridge-Esmer-Intel-IVB%20Server%20Hotchips2014.pdf
The third generation Xeon E5 and Xeon E7 (Haswell EP) use a two-ring implementation for processors with more than 8 cores (maximum of 18 cores)

For all of these systems, addresses are hashed among the L3 slices using an undocumented hash function. This makes it extremely difficult to create an application that experiences a "hot spot" by directing too many accesses to a single L3 slice, but it also makes it (effectively) impossible to exploit the lower latency of accesses to the L3 cache slices that are "close to" the core making the request.

Although the Intel marketing material often includes the word "scalable" in describing these ring-based architectures, a ring is clearly not a scalable topology in the usual sense of the word. The bisection bandwidth is constant, so the available bandwidth per core decreases linearly as the number of cores increases. The average number of "hops" on the ring also increases linearly with ring size, contributing to a modest increase in latency in recent processors.

The Xeon Phi processor (Knights Corner) takes the ring topology to an extreme. This processor has no shared last-level cache, so on an L2 miss the processor must somehow check all the other L2 caches. The number of cores (62 physical, up to 61 enabled) and the available DRAM bandwidth (up to 352 GB/s peak, up to almost 200 GB/s sustained) makes it impractical for all of the L2 caches to be snooped on each L2 miss. (At ~200 GB/s, each core would have to snoop 3 addresses per cycle, which is well beyond a practical design point for the L2 cache tags. Instead, addresses are hashed across a set of "distributed tag directories" (DTDs) which contain duplicates of the L2 cache tags. So only one DTD needs to be snooped for each L2 miss, which reduces the access rate to ~1 access every 20 cycles in each of the DTDs. The downside is that once again locality is lost -- requests and responses to/from the DTDs have to traverse up to the full circumference of the ring. The combination of this multi-level coherence protocol with distributed DRAM mapping and some other complexities results in a rather high average memory latency of ~275 ns (~300 cycles).

Some mitigations of these impacts are possible. Xeon E5 v3 (Haswell EP) supports a "cluster on die" mode that splits the cores and L3 cache slices into two groups. This reduces the average L3 access latency relative to the default mode (hashing addresses over all L3 slices), and might also reduce the memory latency (that depends on the snoop response time from the other chip, which I have not tested yet). Intel has also recently disclosed that the next-generation Xeon Phi ("Knights Landing") will support a mode that forces addresses to be mapped to a DTD that is in the same "quadrant" as the memory controller that owns the address, as well as supporting a mode that effectively splits the chip into four NUMA nodes.