The access to the portions of the Last Level cache by each core is different. Each core "owns" a part of the LLC which it will have it's reads brought into, i.e. if a line is not in the LLC it will be read from memory, sent to the core, AND written in the portion of the LLC assigned to this particular core. If another core then reads this same line, it will be able to access it in the first core's LLC portion.
This means that if a request from a core needs to replace a line in the LLC, it will only replace lines in the portion of the LLC allocated to this core and not to another core.
Hi, I need to model the behavior of the Intel L5630 L3 12MB 16-way cache. I don't understand your explanation but maybe a more detailed example would help. Specifically, I don't understand how the set is selected - which bits in the physical address are used and how, and which bits are used as the tag.
It is my understanding that this cache has 12,288 sets (12MB/16 way/64B cache line). For a 40 bit physical address (BE, right to left, LSB=0 to MSB=39), bits[2-0] select the bytes in each quad word (these bits are not referenced by the cache HW itself) and bits[5-3] select the quad word in the cache line. Then the set would typically be selected by several of the next higher bits, depending on the # of sets, which is typically a power of two. For example, if the number of sets is 64 then bits[11-6] would select the sets 0-63.
But 12,288 is not a power of 2, and bits[19-6] would select sets 0-16383 and bits[18-6] would select sets 0-8191. So how would this work?
If bits[19-6] are used, then both a value of 16383 and 4095 map into the same set, 4095, if you compute the set as the remainder, ie. bits[19-6] % 12288. What would the tag be?
When you mentioned that the L3 is split across cores, are you referring to the sets (12,288 sets / 4 cores)? or to the columns (8-way / 4 cores)?
If I recall correctly the Westmere EP supports modulo-3 address mapping across the three memory channels using all physical address bits above the cache line boundary. Of course latency is a bit less critical when going to memory than when going to L3.
The Hot Chips presentation on Westmere EP (http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf) says (slide 12):
L3 still 16-way shared. SET address arithmetic changed.
For Westmere EX, the Hot Chips presentation (http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf) says (slide 8):
Distributed 10 slice, shared LLC (L3 cache)
10 way Physical Address hashing to avoid hot spots
For Sandy Bridge EP and newer processors, Intel has been clear that the mapping of physical addresses to L3 slices is undocumented. This also means that the mapping of physical addresses to sets within each L3 slice is also undocumented. On the other hand, these newer processors have "per slice" uncore performance counters, so it is straightforward to build test codes to determine the mappings. The physical address to slice mapping is easy -- just load/flush/load repeatedly on a single address and see which L3 CBo records the most accesses. Increment the address and repeat. The hard part is translating the results into a formula, but it can sometimes be done. Once you have a set of addresses that map to the same L3 slice, you can look for associativity conflicts by repeatedly loading a set of addresses (where the set size is larger than the associativity) and checking the L3 victim counts. This can be a very slow process, but in principle it is straightforward. There is no way to know how these sets are internally numbered (just as there is no way to know how the "ways" are internally numbered), but it is possible to discover sets of addresses that map to the same set in the same slice.
I have not seen any documentation on Westmere EP that would provide information about the mapping, but I have not looked very hard. There are some sneaky tricks that can be done with page coloring to partition a cache, but other than that I have not seen a lot of reason to worry about the details. On my Sandy Bridge EP (8-core) processors, L3 bandwidth for contiguous accesses scales extremely well (7x or more using 8 threads), so the hash works effectively for the cases that interest me.
A few more comments....
An introduction to the Xeon E5 2600 family (Sandy Bridge EP) at https://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-... includes the comment (slightly reorganized for clarity):
In the Intel Xeon processor 5600 series there was a single cache pipeline and queue [used by] all cores. In the Intel Xeon processor 2600/4600 product family [each L3 cache slice has a full cache pipeline].
The mapping of physical addresses to L3 slices in the Xeon E5 2600/4600 families is probability a hash on a relatively large number of bits, but I have not tried to document it. One reason not to use a simple 3-bit index (for the 8-core/8-slice version) is that accesses tend to be concentrated at the bottom of 4KiB pages, so you don't want the "zero offset" line of all 4KiB pages to map to the same slice. Typically selection algorithms would XOR some higher-order address bits so that randomly chosen 4KiB pages will have their "zero offset" lines mapping to different L3 slices. Of course any product with a non-power-of-two L3 slices will need a more complex hash as well.
Internal to each L3 slice, the mapping might not be as complex as I speculated in my previous note. The Xeon 5600 L3 is 2 MiB per core, with 16-way associativity. This is 128 KiB/way, which allows simple binary set selection. Similarly the Xeon E5 2600 (Sandy Bridge EP) L3 is 2.5 MiB per core, with 20-way associativity. This is also 128 KiB/way, which allows simple binary set selection. I will need to do some experiments, but if the mapping is as simple as I hope, then I will be able to use a specialized page allocator to vary the effective size of the L3 cache. I.e., each L3 slice is 128 KiB/way, or 32 4KiB pages "tall". It should be possible to select physical pages that map to any subset of those 32 page-sized set indices, thus limiting the effective size of the L3 cache to N/32 times the "normal" size, where 1<=N<=32. (Note that since the L3 is inclusive, this might also reduce the effective L2 size -- I need more coffee to work out the mappings.)