Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Mapping when the cache size is not a power of 2

anthony_b_
Beginner
3,826 Views

Hi, Greetings,

I am working on Intel(R) Xeon(R) CPU E5620  @ 2.40GHz , with 4 cores/socket and 2 sockets, 12MB smart cache.

http://ark.intel.com/products/47925/Intel-Xeon-Processor-E5620-12M-Cache-2_40-GHz-5_86-GTs-Intel-QPI

> cat /sys/devices/system/cpu/cpu0/cache/index3/number_of_sets

12288

>cat /sys/devices/system/cpu/cpu0/cache/index3/ways_of_associativity

16

I am confused about two concepts:

1) What does the above output mean.?How shall I interpret it?
I have 2 interpretations on mind, please help me which is proper.
case (i): 12MB LLC shared across 4 cores within a single socket. Separate LLC per socket. No sharing of LLC from one socket to another. Each slice(LLC chunk associated with single core) size is 3MB with 12288/4= 3072 sets per slice, each set with 16ways in a slice. (Each set have total 64ways :SNUCA)

case (ii): 12MB LLC shared across 8 cores of both sockets together. No separate LLC per socket.  Each slice size is 1.5MB with 12288/8= 1536 sets per slice, each set with 16ways in a slice. (Each set have total 128ways(16*8) :SNUCA)

Which case is proper? I guess case (i) is correct. Please do correct me if I am wrong.

 

2) How is the addressing is done if the cache size/number of sets are not power of 2?

Explanation of question:
For above example, LLC cache size is 12MB, which is shared across 8 cores (so 8 slices). Each is of size 1.5MB with 1536 sets. This is the typical cases where we have 6MB/12MB/24MB caches. (not power of 2).

In this case, how the mappings will be done, certain bits are not used in the cache set index field. Does this not create any problem?
To me more clear, 1536 means it will take 11 bits as set index (into L3 cache), to identify a single set.  Say B6 to B16. (B0 to B5 are line offset bits). The range of set index is 000 0000 0000 to 101 1111 1111. The bits B16 and B15 can not be 1 simultaneously. It will give set index greater than the maximum set number available. Can you kindly explain how this problem is resolved in these caches of size non-power of 2.? How does the mapping of Physical address into the LLC will work? When process allocated memory the PA will be continuous. There will not be any gap for bit 15 and bit 16. So the combination of 11X XXXX XXXX as set index will occur.

 

3) How is cache coherence is maintained across the two sockets using QPI?

 

Thank you for your kind attention and time.

 

0 Kudos
14 Replies
McCalpinJohn
Honored Contributor III
3,826 Views

Intel has definitely not released all the details of how they perform the cache mapping -- especially in newer systems.

There are two places where non-power-of-two sizes occur.

  1. Recent Xeon processors generally have 2.5 MiB LLC per "slice", which is 128KiB of addresses (2048 cache line addresses) with a 20-way associativity for each location.
  2. Recent Xeon processors perform a hash on many high-order address bits to determine which LLC "slice" a line should map to.  The detailed hash has been reverse-engineered for Xeon E5 v1, v2, v3 systems with 8 cores.  In this case the hash can be expressed as an XOR.   For non-power-of-two slice counts the hash is harder to understand, and I am not aware of any closed-form expressions that have been discovered. 

 

0 Kudos
anthony_b_
Beginner
3,826 Views

Thanks for your kind response.

Yes, I am aware of some of the reverse engineering works on the hash, to identify the slice. Hash uses set index + tag bits together. But my question is not about identifying the slice. It is only about the mapping of set index bits into corresponding set in slice, in case of non-power-of-two size caches. 
As you already mentioned ,

 "Intel has definitely not released all the details of how they perform the cache mapping -- especially in newer systems."

it need not be newer machines. Even any older configuration with non-power-of-two size cache mapping will be helpful to understand.  

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,826 Views

Slide 8 of the presentation at http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf shows that the L3 cache on the Xeon E56xx processors is built up from 6 slices, each 2MiB and 16-way associative (so 2048 cache-line-sized sets per L3 slice).

I have not done any experimentation to see what hash is used to select the set.   L3 latencies are probably high enough that a full divide by 6 on the high-order address bits could be implemented.   Then the remainder could be used to select the slice, and bits 17:6 of the quotient could be used to select from the 2048 sets in the slice (just like in an ordinary power-of-two cache).

Potentially applicable anecdotes:

  • If I recall correctly, IBM's POWER5 processor used a full divide-by-three on the physical address to determine the mapping into the 3 slices of the L2 cache on that system, so it is certainly plausible to do it that way.
  • On the other hand, in POWER4 we did not have enough time to perform the full divide by three, so we use an XOR-type hash using a smaller number of bits.   This did not give perfect uniformity of access across the three L2 slices, but it was close enough that it required very careful testing to observe the difference.
0 Kudos
anthony_b_
Beginner
3,826 Views

Can you tell me as specified in the url,

http://ark.intel.com/products/47925/Intel-Xeon-Processor-E5620-12M-Cache-2_40-GHz-5_86-GTs-Intel-QPI

What 12MB smart cache means, is it for one socket or it is collective cache size for two sockets together (each socket have 6MB) ?

0 Kudos
McCalpinJohn
Honored Contributor III
3,826 Views

Everything on the ark web page above is for a single "processor" in a single socket -- so each socket has 12 MiB of L3 cache.   The L3 caches act like a single cache within each socket, but act as independent caches across sockets.   So it is possible for both sockets to have their L3 cache full of the same 12MiB of data, or it is possible for each socket to be holding 12MiB of different data --- or any combination in between.

In most Intel processors there is one active L3 slice for each active core, but this may not be required.  Some models have been configured with 2 MB of L3 per core (16-way set associativity instead of 20-way), but this was presumably a cost-reduction measure with mimimal architectural implications.

Whether the L3 cache on the two chips can be used a a single cache depends on the use model.  Data requests from socket 0 may hit cache line in the other (remote) cache.  Whether the data will be serviced from the remove cache or local memory depends on the cache states of the corresponding lines.. Therre are a few cashes were local L2 misses can be satisfied from the local chip (particularly if the data is dirty, otherwize it will it will come from memory.

0 Kudos
anthony_b_
Beginner
3,826 Views

Thank you for the explanation, it gave certain clarifications about some of the assumptions I made.

Earlier I thought set computation is straightforward like using the bits 6:17 , but you mentioned it can also be through hash.
Secondly I assumed that each core must associated with a slice. But it is also not true. Though I wonder how it could be.

Some models have been configured with 2 MB of L3 per core (16-way set associativity instead of 20-way)

I understood this statement as slices with varying sizes.

0 Kudos
McCalpinJohn
Honored Contributor III
3,826 Views

From the die photo on slide 8 of the Hot Chips 22 presentation linked above, it is clear that the six 2MiB L3 slices are *adjacent* to the six cores, but a review of the Intel product listings for this processor family at http://ark.intel.com/products/series/47915/Intel-Xeon-Processor-5600-Series#@Server shows that almost all of the parts have a 12MiB L3 cache (6 slices), whether they have 4 or 6 cores.  At the very bottom of the list there is one product (Xeon E5603) that has only 4 MiB L3 cache (2 slices), despite having 4 cores enabled.  So it is pretty clear that the cores and slices can be enabled independently.

For newer processors the block diagrams produce a stronger impression of "coupled" core+L3 slice, but a careful review of the product offerings shows that cores and L3 slices can be enabled independently.  For example, the Xeon E5-2643 v3 has 6 cores and 20MiB L3 cache (8 slices), while the Xeon E5-2643 v2 has 6 cores and 25MiB L3 cache (10 slices).  There are lots of other examples.

In an earlier note, I speculated that some of the newer processors might have 2MiB slices (obtained by reducing the associativity from 20-way to 16-way).   Looking back over the available products, I don't see any that require this explanation -- all of the configurations that I have reviewed today can be built from 2.5MiB L3 slices, as long as the number of L3 slices does not need to match the number of cores....  (UPDATE 2016-11-02 -- I found some examples that can't be built with 2.5 MiB cache slices -- see note below).

0 Kudos
McCalpinJohn
Honored Contributor III
3,826 Views

I looked over the Xeon E5-2xxx processors (v1/v2/v3/v4) this morning and did not find any that could not be built with an integral number of 2.5 MiB L3 slices.  

But I could not shake the feeling that I had seen at least one processor that looked like it had to be built with 2.0 MiB L3 slices.   I kept looking and finally found two products that are clearly in the product lines that use multiple 2.5 MiB L3 slices, but which have a reported L3 size that is not divisible by 2.5 MiB:

  • Xeon E5-1650 (v1)    6 cores with 12 MiB L3
  • Xeon E5-1650 v2      6 cores with 12 MiB L3

These are the two that got me thinking that Intel had to have a way to reduce the associativity of an L3 slice from 20-way to 16-way to get a 2 MiB slice.

I also noticed that the Xeon E5-1650 v3 and v4 remain at 6 cores, but both have 15 MiB L3 -- suggesting that they have 6 2.5 MiB L3 slices.

0 Kudos
anthony_b_
Beginner
3,826 Views

Thank you for the references.

For newer processors the block diagrams produce a stronger impression of "coupled" core+L3 slice, but a careful review of the product offerings shows that cores and L3 slices can be enabled independently.  For example, the Xeon E5-2643 v3 has 6 cores and 20MiB L3 cache (8 slices), while the Xeon E5-2643 v2 has 6 cores and 25MiB L3 cache (10 slices).  There are lots of other examples.

When I looked into the specification of products as example in http://ark.intel.com/products/81900/Intel-Xeon-Processor-E5-2643-v3-20M-Cache-3_40-GHz, I could not find the details of the number of slices (or slice size as 2.5MB or 2MB).

From all the information you have provided it seems like slice size as 2.5MB or 2MB. Can't be the slice size more than that (like 3MB or 4MB or less like 1.5MB)? For example http://ark.intel.com/products/47925/Intel-Xeon-Processor-E5620-12M-Cache-2_40-GHz-5_86-GTs-Intel-QPI, I above machine with 4 cores but 12MB LLC, I am assuming it would be 3MB as a slice size.

Is the other sizes of slice (1.5/ 3 /4MB)  are not feasible? Can you kindly provide any reference to identify the slice size / number of slices??

 

 

0 Kudos
anthony_b_
Beginner
3,826 Views

...that cores and L3 slices can be enabled independently.

Can you kindly give some light on how such configurations might work with unequal number of slices and cores? From above statement can I infer that slices also can be disabled as cores disabled when not required (to save power) ?

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,826 Views

My comments about slice sizes were only in reference to the newer processors -- Xeon E5 v1, v2, v3, v3 -- all of which have 2.5MiB L3 slices in the most common configurations.   This can be seen in many places -- for example Figure 8 of https://software.intel.com/en-us/articles/intel-performance-counter-monitor, and in any of the "Hot Chips" presentations on the Xeon E5 processors, such as http://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-8-Big-Iron-Servers-epub/HC26.13.832-IvyBridge-Esmer-Intel-IVB%20Server%20Hotchips2014.pdf

Your Xeon E5620 is an older "Westmere" processor that has six 2MiB L3 slices.  This is clear from the text and the die photo on the right side of slide 8 of the Hot Chips presentation at http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf

For any of these processors, the L3 slices are based on the physical layout of the die, so they can't be made bigger -- the only issue is whether there is evidence that some products can be configured to use less than the full 2.5 MiB.     In the Xeon E5 processors the ring-based architecture allows processors and L3 slices to be enabled independently.  The hardware has to know how many L3 slices are enabled so that it can set up the address hash, but this is completely independent of the number of processor cores enabled -- the ring provides the same functionality as a general shared bus on earlier multiprocessor systems.

0 Kudos
anthony_b_
Beginner
3,826 Views

Thank you for your kind response.

The hardware has to know how many L3 slices are enabled so that it can set up the address hash, but this is completely independent of the number of processor cores enabled -- the ring provides the same functionality as a general shared bus on earlier multiprocessor systems.

1) The later part of statement makes complete sense. But first part, I would like to know more. As you mentioned earlier, slice size is limited by die area constraints. (though overall LLC size remains same). Always bigger caches improve performance, but here also SRAM cost and area is the problem. Given these statements, one do not dare to waste the cache memory space available on the chip (in view of performance). So whats the point of disabling the slices, I mean to say I did not find any reason where certain slices are to be disabled. If that is the case, then number of slices for the given machine always remain the same. So my opinion is, there is nothing like disable/enable of slices. But anyway number of cores active at any time can vary, as you rightly mentioned (I found settings in BIOS to disable certain cores).

As you mentioned in the previous comments,

17:6 of the quotient could be used to select from the 2048 sets in the slice (just like in an ordinary power-of-two cache).

2) Can you share some thoughts on how does a set is selected when the cache/slice is non-power-of-two size (like 2.5MB)? 11bits are insufficient, 12 bits will be more (with a gap: 2 MSB bits can not be simultaneously one). 

3) I would like to know one more point. How one can find out the number of slices available in the LLC of a given machine.

0 Kudos
McCalpinJohn
Honored Contributor III
3,826 Views

(1) Intel has disclosed die photos and layouts for some of their processors, but they clearly state that they may make other layouts that they don't feel any need to disclose.   Reduced core-count parts typically have one L3 slice enabled for each core that is enabled, but as I mentioned before there are a number of exceptions.   The decision about what configurations to turn into products is a complicated set of tradeoffs related to the yield of the cores, the yield of the caches, the price points that the company wants to address, the continuity of product characteristics from older to current to future processors, and many other factors.   There is no reason to believe that it is possible to develop a detailed understanding of the motivations behind the final product decisions without having participated in the company's internal decision-making process.

As a specific example of products with disabled L3 slices, Volume 2 of the datasheet for Xeon E5 v2, Xeon E5 v3, and Xeon E5 v4 includes documentation of a register that shows (using a bit map) which L3 slices are disabled.   The names are a bit different from one generation to another, but Xeon E5 v2 calls this bit map LLC_SLICE_IA_CORE_EN, while Xeon E5 v3 and v4 call it LLC_SLICE_EN.

(2) The 2.5 MiB slices are 20-way set-associative, so they can be internally addressed using 11-bits above the cache line address.  2048*64*20 = 2.5*1024*1024.    The selection of which slice to use is not so easy, since there may be 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, or 22 slices enabled.   For the case of 2, 4, or 8 enabled slices, the mapping has been reverse-engineered and described in https://cmaurice.fr/pdf/raid15_maurice.pdf. For less convenient numbers, the slice select could start with a modulo operation, or the modulo operation could be approximated by an XOR-based hash across a lot of bits.   I have a suspicion that Intel uses the latter because I found that of the 32768 cache lines in a 2MiB page, the 12-core Intel Xeon E5 v3 processor assigned 2688 cache lines to each of 8 of the 12 slices and 2816 cache lines to each of the remaining 4 slices.  I can't think of any reason why a modulo-based hash would produce such a large difference (almost 5%) in the distribution (but I suppose it is possible for high-order bits used for randomization might account for this much deviation even with a perfect modulo-12 as a start).

(3) For the the Xeon E5 v2, v3, v4 processors, the number of bits set in the LLC_SLICE_EN field indicates the number of LLC slices.  For Xeon E5 v3 and v4, MSR 0x702 "U_MSR_PMON_GLOBAL_CONFIG" returns the number of CBos enabled.

0 Kudos
anthony_b_
Beginner
3,826 Views

Thank you so much.

I will move forward with these valuable inputs and come back after I study little more on these.

 

0 Kudos
Reply