Solved: Difference between cache banks and cache slices

anthony_b_ · ‎08-05-2016

Hi,

I have recently read about cache banks in L1 Data cache. It is mentioned that cache banks provide concurrency and increase bandwidth by servicing more requests. I read this related to Intel's scatter-gather implementation technique which is mentioned in paper "CacheBleed: A Timing Attack on OpenSSL Constant Time RSA".
My intuition after reading this is, "Each cache line is divided into 8 parts and each part is there in a seperate cache bank. Same applicable to all lines of all sets and ways. Thats how banks are designed." Please do correct me, if I am wrong somewhere.

Similarly I read about cache slices at LLC level. Use of slices rather than a single chunk of LLC reduces the access latencies. My understanding is "Two consecutive lines will map to two consecutive sets in a particular Slice". The argument I heard is, "To increase parallel access, two consecutive lines are mapped into two different slices, so that they can be accessed parallelly." I am aware of the hashing function used for mapping the addresses into slices. Can I have the clarification which one is right of the two?

I would like to know how the mapping works in these two concepts "cache banks" and "cache slices", and they differ in storing wrt mappings (consecutive lines, set indices)?

Thank you.

McCalpinJohn · ‎08-09-2016

This could have a very long answer....

The short answer to the question about "banks" is: Sandy Bridge and Ivy Bridge have 8 banks in their L1 Data Caches. As you described above, each of the 8 banks handles an independent aligned 8-Byte field in each cache line. Banks can only service one access per cycle, so if you want to execute multiple reads (or writes) in a cycle they must be to different banks. Haswell and newer processors use a completely different L1 implementation, so this concept of "banks" does not apply.

The short answer to the question about "slices" is: L3 caches on recent Intel processors are built up of multiple independent slices. Physical addresses are mapped across the slices using an undocumented hash function with cache line granularity. I.e., consecutive cache lines will be mapped to different L3 slices. The hash function was designed to spread the traffic approximately evenly across the slices no matter what the access pattern. (The obvious exception is all cores accessing a single cache line, since a single cache line has to be mapped to a single L3 slice.) The use of multiple "slices" is intended to increase bandwidth, not to reduce latency. Unloaded latency is typically slightly higher with a multi-slice cache, since there is no (practical) way to prevent most of your accesses from being to "remote" L3 slices. This is more than made up for by the increased throughput of the multi-slice L3.

View solution in original post

McCalpinJohn · ‎08-09-2016

This could have a very long answer....

The short answer to the question about "banks" is: Sandy Bridge and Ivy Bridge have 8 banks in their L1 Data Caches. As you described above, each of the 8 banks handles an independent aligned 8-Byte field in each cache line. Banks can only service one access per cycle, so if you want to execute multiple reads (or writes) in a cycle they must be to different banks. Haswell and newer processors use a completely different L1 implementation, so this concept of "banks" does not apply.

The short answer to the question about "slices" is: L3 caches on recent Intel processors are built up of multiple independent slices. Physical addresses are mapped across the slices using an undocumented hash function with cache line granularity. I.e., consecutive cache lines will be mapped to different L3 slices. The hash function was designed to spread the traffic approximately evenly across the slices no matter what the access pattern. (The obvious exception is all cores accessing a single cache line, since a single cache line has to be mapped to a single L3 slice.) The use of multiple "slices" is intended to increase bandwidth, not to reduce latency. Unloaded latency is typically slightly higher with a multi-slice cache, since there is no (practical) way to prevent most of your accesses from being to "remote" L3 slices. This is more than made up for by the increased throughput of the multi-slice L3.

anthony_b_ · ‎08-09-2016

Thank you Mr.John for your detailed explanation.

Can you provide few more clarifications.

So Haswell onwards there is no concept of banks then.?

Given a physical address, will that line always map to same slice and same cache set (LLC) or it will vary dynamically? (S-NUCA and D-NUCA). Secondly do each slice have its own independent replacement policy or they have a common policy working on all slices "together"?

McCalpinJohn · ‎08-10-2016

The Haswell (and Broadwell, and probably Skylake (client)) L1 Data Cache looks like a single bank with two 64-Byte-wide read ports. The cache can service any two loads of any size and any alignment in one cycle as long as neither of the loads crosses a cache line boundary. If a load crosses a cache line boundary, then both ports are needed to service the load and no other load can occur in that cycle. Stores are harder to characterize....

The mapping used by the sliced LLC is fixed at boot time, but the replacement and prefetch policies are dynamic. The associativity of the LLC is *within* a slice. For the standard 2.5 MiB slices, the associativity is 20-way. Each of the slices handles 2048 different "congruence classes" of cache-line addresses, and each "congruence class" is 20-way associative. (2048 * 64 * 20 = 2.5 MiB). Knowing this is not terribly helpful, because we don't know which L3 slice a physical address is mapped to, and we don't know which congruence class within the slice is assigned to handle a particular address. The mapping of addresses to L3 slices can be determined by using the performance counters, but the mapping of lines to congruence classes within the slice is not so easy. It is easy to test a guess by loading >20 lines that you think will map to the same place and looking for evictions, but there are a lot of possible mappings....

anthony_b_ · ‎08-10-2016

Oh, thank you. Now I understood how the banks concept is replaced with multiple ports in cache.

Do you mean in current Intel's architectures "the replacement policy used for the slice can change every time" (at every boot)!!?

So it seems a physical address is mapped to the same slice and same congruence class always. My understanding is, congruence class means the "set number" which is indicated by the "set index bits" (which are from bit 7 onwards, bits 0-5 are for cache line offset) of physical address. Is my intuition is right with the word "congruence class"?
I read from literature that, one can find out, "what congruent class (set) it maps to!" by using LARGE PAGES (huge pages). Can please you comment on this?

McCalpinJohn · ‎08-10-2016

The mapping used by the sliced LLC is set at boot time because that is when the "snooping mode" is set. The mapping of physical addresses to L3 cache slices is different in "Home Agent" mode (where the addresses are distributed around all of the slices) and in "Sub-NUMA-Cluster" mode (for which the lower half of the physical address space is mapped to 1/2 of the LLC slices and the upper half of the physical address space is mapped to the other 1/2 of the LLC slices).

The mapping of physical address bits to congruence class is a simple bit select for the L1 and L2 caches, but the multi-sliced L3 complicates this. For example, if there are 14 L3 slices, then there is probably a division of the upper address bits by 14 (or some approximation to this based on a subset of the bits). On a 12-core Haswell EP (Xeon E5 v3) part, I used the CBo performance counters to determine the mapping of all of the cache lines in a 2MiB page to the L3 slices. For the particular 2MiB page I was looking at, 8 of the L3 slices were responsible for 2688 cache lines each, while 4 of the L3 slices were responsible for 2816 cache lines each. This adds up to the expected 32,768 cache lines, but it is not as uniform as I expected.

For the L1 Data Cache, the virtual to physical mapping does not influence cache placement, since the congruence class is determined by bits 11:6, which are not translated with 4KiB or larger page sizes. For a (typical) 256 KiB, 8-way-associative L2 cache, there are 512 congruence classes (selecting using bits 14:6). Of these bits, three (14:12) are translated when using 4KiB pages. Large (2MiB or larger) pages don't translate any of the address bits used to select the L2, so with large pages the L2 mapping is fully controllable.

Unfortunately this does not work for the multi-sliced L3. It appears that the "slice select" is based on all of the bits of the physical address, so even with 1 GiB pages you will not be able to fully control L3 mapping (unless you limit the system to one 1 GiB page of user data -- in that case you will get the same virtual address to L3 mapping every time).

Travis_D_ · ‎03-01-2019

McCalpin, John wrote:
The mapping used by the sliced LLC is fixed at boot time, but the replacement and prefetch policies are dynamic.

When you say "prefetch" here, what prefetch are you talking about? Is there a prefetch that is initiated by the L3, rather than the L1 or L2?

McCalpinJohn · ‎03-04-2019

Wow, this is a really old post! In the last 2.5 years, I have learned a lot about Intel's address hashing algorithms.... Trying to get it written up in reports....

In this post, I was talking about the L2 HW prefetchers in SNB through BDW. Under low loads, the L2 HW prefetcher generates prefetches into the L2 cache, but as the L2 miss buffers get closer to full, the L2 HW prefetcher switches to generating prefetches into the (inclusive) L3 cache. (I have not reviewed the details of the L1 HW prefetchers -- they generate minimal concurrency (and that concurrency is shared with the L1D LFBs), so I don't really care if their behavior is dynamically adaptive or not....)

On SKX, the L3 is exclusive (rather than inclusive), and (by default on my systems) the L2 HW prefetcher only generates prefetches into the L2 cache. There is a non-default BIOS option to enable "L3 prefetching", but it had only small impact on performance (sometimes faster, sometimes slower), so I did not perform any analysis of exactly what it was doing. A curious side effect of the lack of "L3 prefetching" is that access to *remote* memory can have higher bandwidth than accesses to *local* memory. There is a separate prefetcher associated with the UPI interface to the other socket(s), and this prefetcher is enabled by default (on the systems I tested). This prefetcher is aggressive enough to provide higher bandwidth for remote accesses than for local accesses -- at least for a single thread. It only takes a few threads to saturate the UPI interface, so multi-threaded bandwidth definitely favors local accesses.

Travis_D_ · ‎03-04-2019

McCalpin, John wrote:
Wow, this is a really old post! In the last 2.5 years, I have learned a lot about Intel's address hashing algorithms.... Trying to get it written up in reports....

Yes, it's an oldie but a goodie...

Out of curiosity, are these reports available to the public to read? I would be very interested in seeing them.

In this post, I was talking about the L2 HW prefetchers in SNB through BDW. Under low loads, the L2 HW prefetcher generates prefetches into the L2 cache, but as the L2 miss buffers get closer to full, the L2 HW prefetcher switches to generating prefetches into the (inclusive) L3 cache. (I have not reviewed the details of the L1 HW prefetchers -- they generate minimal concurrency (and that concurrency is shared with the L1D LFBs), so I don't really care if their behavior is dynamically adaptive or not....)

Right, so in these architectures we think there is only an L2 prefetcher (we can target the L3) and no dedicated L3 prefetcher, right? Does the UPI PF you mention below exist on these arches as well?

On SKX, the L3 is exclusive (rather than inclusive), and (by default on my systems) the L2 HW prefetcher only generates prefetches into the L2 cache.

It is not immediately obvious to me how the change from inclusive to exclusive would affect prefetching into the L3. It still seems that you could prefetch into the L3. Of course, you don't want to prefetch into the L3 if that would evict a line already in the L2, but that wouldn't seem to be a usual case since the prefetching is driven by the L2, checks the L2 first, so wouldn't make requests for such lines. Of course there may be factors that I am not considering.

A bigger factor I think is it the increased latency and reduced size of the L3: both would seem to tip the scales in favor of L2 prefetching.

There is a non-default BIOS option to enable "L3 prefetching", but it had only small impact on performance (sometimes faster, sometimes slower), so I did not perform any analysis of exactly what it was doing. A curious side effect of the lack of "L3 prefetching" is that access to *remote* memory can have higher bandwidth than accesses to *local* memory. There is a separate prefetcher associated with the UPI interface to the other socket(s), and this prefetcher is enabled by default (on the systems I tested). This prefetcher is aggressive enough to provide higher bandwidth for remote accesses than for local accesses -- at least for a single thread. It only takes a few threads to saturate the UPI interface, so multi-threaded bandwidth definitely favors local accesses.

That is very interesting, and probably a real confounding factor in some performance investigations since it reverses the usual behavior of remote vs local. Are you aware of any other discussions of this effect? It's almost worth a paper...

McCalpinJohn · ‎03-04-2019

I am working on 2-3 reports to try to finish out the work that I did in 2018.

Using microbenchmarks to determine core and L3 slice numbers in Intel SKX (and KNL) processors -- expanding the presentation at https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx
Using microbenchmarks to derive the "slice select hash" in Intel SKX processors -- expanding the presentation at https://www.ixpug.org/components/com_solutionlibrary/assets/documents/1538092216-IXPUG_Fall_Conf_2018_paper_20%20-%20John%20McCalpin.pdf and (hopefully) including a closed-form representation of the "slice select hash" in the 14-core SKX processor as well as in the 24-core SKX processor (included in the referenced presentation).
Mitigating Snoop Filter Conflicts in 24-core Skylake Xeon Processors -- follow-up on my SC18 paper http://sc18.supercomputing.org/proceedings/tech_paper/tech_paper_pages/pap421.html with a lot more details about the properties of the address hash.

I will probably post something in these forums as I complete the reports, and will also post on my blog https://sites.utexas.edu/jdm4372/

I am not aware of a "QPI prefetcher" in systems through Broadwell -- this appears to be new in SKX.

Dropping the "prefetch to L3" and increasing the "prefetch to L2" makes sense due to the size and associativity changes:

SKX quadrupled the size of the L2 (0.25 MiB -> 1.0 MiB) and doubled its associativity (8-way -> 16-way)
SKX shrank the L3 (2.5 MiB per core -> 1.375 MiB per core) and reduced its associativity (20-way per slice -> 11-way per slice)

With a fully-exclusive L3 cache, prefetching to L3 would invalidate or downgrade lines in other L2 caches -- a very different behavior than in prior generations. The prefetch could be dropped if the address hits in the snoop filter, but that requires extra logic (and may not always be the right answer for performance).

Looking back at my original results, I see that the higher remote bandwidth only occurs when the LLC-prefetch is enabled in the BIOS. With LLC-prefetch enabled, the remote BW increases by ~20% and is about 17% higher than the local BW for the same test. Local BW is almost unchanged by the LLC prefetch. Like Sandy Bridge EP, single-threaded BW on SKX is slightly higher with allocating stores than with streaming stores. Unlike Sandy Bridge EP, vectorization increases single-threaded BW, but the amount depends on the ISA used.