Question about address translation on Xeon 5600s L3 cache

zhangyihere · ‎07-31-2012

Hi ,all !
I am using Xeon 5650 processor. It shows its last-level cache is 12MB and shared among 6 cores.

What I am wondering is, 12 is not the power of 2, so if address falls in 12MB to 16MB, how to allocte L3 cache position for it?

Hussam_Mousa__Intel_ · ‎07-31-2012

Hello,

The last level cache is actually split evenly across the 6 cores. So while each core can access (load) from the entire 12MB range, their requests will only be cached into their slice.

Most caches, including the LLC, use set associativite. This means that when address is mapped to a cache line, there are several locations that it can be written to. As opposed to direct mapping which has only one location per cache line.

You can read more about cache associativity on wikipedia: CPU cache

I hope this helps,
Hussam

zhangyihere · ‎08-01-2012

Thank you Hussam,

I am sorry, I think I didn't give my question clearly enough. I make it again.

If the last level cache can entirely be accessed by 6 cores, it also means on each core, all the address space(we neglect physical address holes here) can use last level cache.

BUT on Xeon 5650, because the size of last level cache is not power of 2, if we directly use address divides cache size, it is not exactly divisible by all the addresses. For the undivisible addresses, they should use last level cache as well. But here is my question, how do these undivisible addresses be mapped to last level cache? If directly using undivisible remaider as cache index, it is unavoidablely some cache sets service more accesses. Therefore, accesses to last level cache is not evenly distributed.

Am I correct? Or is there some additional design at cache?

Thank you in advence!

Yi Zhang

suhailinternational · ‎08-01-2012

Thank you hasam your answer is very help ful for me . i personlly thank you.

Hussam_Mousa__Intel_ · ‎08-13-2012

Hello Yi Zhang,

Let me make some clarrifications below.

Quoting zhangyihere

If the last level cache can entirely be accessed by 6 cores, it also means on each core, all the address space(we neglect physical address holes here) can use last level cache.

The access to the portions of the Last Level cache by each core is different. Each core "owns" a part of the LLC which it will have it's reads brought into, i.e. if a line is not in the LLC it will be read from memory, sent to the core, AND written in the portion of the LLC assigned to this particular core. If another core then reads this same line, it will be able to access it in the first core's LLC portion.

This means that if a request from a core needs to replace a line in the LLC, it will only replace lines in the portion of the LLC allocated to this core and not to another core.

Quoting zhangyihere

BUT on Xeon 5650, because the size of last level cache is not power of 2, if we directly use address divides cache size, it is not exactly divisible by all the addresses. For the undivisible addresses, they should use last level cache as well. But here is my question, how do these undivisible addresses be mapped to last level cache? If directly using undivisible remaider as cache index, it is unavoidablely some cache sets service more accesses. Therefore, accesses to last level cache is not evenly distributed.

The LLC is set associative. What this means is that eachaddressfrom thePhysical Addressspace will map into exactly oneposition in the LLC, however each position has several slots that can store several memory lines that have all mappedto this sameposition.

For exampleimagine addressesX1, X2,X3 all map to position A in the LLC. And imagine position A has 2 slots. So ifa read to X1 will bring it to[position A, slot 1]. a later read to X2 will bring X2 to [position A, slot 2]. If later X3 is read, then the LLC will need to decide to evict X1 or X2 since X3 can only be written to position A (slots 1 or 2).

In this example this is calles a 2-way set associative cache. In general you can have an N-way set associative cache. The number ofphsical address that canmap to each position is equal to (ADDRESS_SPACE / (CACHE_SIZE /SET_ASSOCIATIVITY_DEGREE) )

The operator is a DIV so they don't need to be perfect multiples in general, although in practice the value of (CACHE_SIZE / SET_ASSOCIATIVITY_DEGREE) iswhat needs to be a perfect divisor of the ADDRESS_SPACE.

Regarding the segregation of the the LLC across the cores,the key to understanding lies in understanding how the setsare distributed across the cores. Each portion of the LLC allocated to a core willslots that represent all the possible positions that a physical address can map to.

I hope this clarifies things some more.
-Hussam

David_O_4 · ‎10-16-2014

Hi, I need to model the behavior of the Intel L5630 L3 12MB 16-way cache. I don't understand your explanation but maybe a more detailed example would help. Specifically, I don't understand how the set is selected - which bits in the physical address are used and how, and which bits are used as the tag.

It is my understanding that this cache has 12,288 sets (12MB/16 way/64B cache line). For a 40 bit physical address (BE, right to left, LSB=0 to MSB=39), bits[2-0] select the bytes in each quad word (these bits are not referenced by the cache HW itself) and bits[5-3] select the quad word in the cache line. Then the set would typically be selected by several of the next higher bits, depending on the # of sets, which is typically a power of two. For example, if the number of sets is 64 then bits[11-6] would select the sets 0-63.

But 12,288 is not a power of 2, and bits[19-6] would select sets 0-16383 and bits[18-6] would select sets 0-8191. So how would this work?

If bits[19-6] are used, then both a value of 16383 and 4095 map into the same set, 4095, if you compute the set as the remainder, ie. bits[19-6] % 12288. What would the tag be?

When you mentioned that the L3 is split across cores, are you referring to the sets (12,288 sets / 4 cores)? or to the columns (8-way / 4 cores)?

Thanks!

McCalpinJohn · ‎10-16-2014

If I recall correctly the Westmere EP supports modulo-3 address mapping across the three memory channels using all physical address bits above the cache line boundary. Of course latency is a bit less critical when going to memory than when going to L3.

The Hot Chips presentation on Westmere EP (http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf) says (slide 12):

L3 still 16-way shared. SET address arithmetic changed.

For Westmere EX, the Hot Chips presentation (http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf) says (slide 8):

Distributed 10 slice, shared LLC (L3 cache)

10 way Physical Address hashing to avoid hot spots

For Sandy Bridge EP and newer processors, Intel has been clear that the mapping of physical addresses to L3 slices is undocumented. This also means that the mapping of physical addresses to sets within each L3 slice is also undocumented. On the other hand, these newer processors have "per slice" uncore performance counters, so it is straightforward to build test codes to determine the mappings. The physical address to slice mapping is easy -- just load/flush/load repeatedly on a single address and see which L3 CBo records the most accesses. Increment the address and repeat. The hard part is translating the results into a formula, but it can sometimes be done. Once you have a set of addresses that map to the same L3 slice, you can look for associativity conflicts by repeatedly loading a set of addresses (where the set size is larger than the associativity) and checking the L3 victim counts. This can be a very slow process, but in principle it is straightforward. There is no way to know how these sets are internally numbered (just as there is no way to know how the "ways" are internally numbered), but it is possible to discover sets of addresses that map to the same set in the same slice.

I have not seen any documentation on Westmere EP that would provide information about the mapping, but I have not looked very hard. There are some sneaky tricks that can be done with page coloring to partition a cache, but other than that I have not seen a lot of reason to worry about the details. On my Sandy Bridge EP (8-core) processors, L3 bandwidth for contiguous accesses scales extremely well (7x or more using 8 threads), so the hash works effectively for the cases that interest me.

McCalpinJohn · ‎10-17-2014

A few more comments....

An introduction to the Xeon E5 2600 family (Sandy Bridge EP) at https://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-overview includes the comment (slightly reorganized for clarity):

In the Intel Xeon processor 5600 series there was a single cache pipeline and queue [used by] all cores. In the Intel Xeon processor 2600/4600 product family [each L3 cache slice has a full cache pipeline].

The mapping of physical addresses to L3 slices in the Xeon E5 2600/4600 families is probability a hash on a relatively large number of bits, but I have not tried to document it. One reason not to use a simple 3-bit index (for the 8-core/8-slice version) is that accesses tend to be concentrated at the bottom of 4KiB pages, so you don't want the "zero offset" line of all 4KiB pages to map to the same slice. Typically selection algorithms would XOR some higher-order address bits so that randomly chosen 4KiB pages will have their "zero offset" lines mapping to different L3 slices. Of course any product with a non-power-of-two L3 slices will need a more complex hash as well.

Internal to each L3 slice, the mapping might not be as complex as I speculated in my previous note. The Xeon 5600 L3 is 2 MiB per core, with 16-way associativity. This is 128 KiB/way, which allows simple binary set selection. Similarly the Xeon E5 2600 (Sandy Bridge EP) L3 is 2.5 MiB per core, with 20-way associativity. This is also 128 KiB/way, which allows simple binary set selection. I will need to do some experiments, but if the mapping is as simple as I hope, then I will be able to use a specialized page allocator to vary the effective size of the L3 cache. I.e., each L3 slice is 128 KiB/way, or 32 4KiB pages "tall". It should be possible to select physical pages that map to any subset of those 32 page-sized set indices, thus limiting the effective size of the L3 cache to N/32 times the "normal" size, where 1<=N<=32. (Note that since the L3 is inclusive, this might also reduce the effective L2 size -- I need more coffee to work out the mappings.)