Solved: Number of L3 slices per CHA

ManiTofigh · ‎09-29-2024

Hello,

I've been looking into the number of slices a CHA handles, and I read on @McCalpinJohn 's research paper that _typically_ each slice is handled by one CHA.

However, there are cases like the "Knights Landing (KNL): 2nd Generation Intel® Xeon Phi™ Processor" where each 2 cores are connected to a CHA (I assume in that case each CHA handles 2 slices?).

So is it safe to assume that by reading the CAPID6 register's value and finding the number of active CHAs, we can later derive that `num_cores / num_CHAs` gives us the number of slices assigned to each CHA, hence `cha_num * slice_per_cha` would give us the total number of slices for the L3/LLC?

Are there any cases that contradict this assumption?

Thank you for your time in advance.

- Mani

McCalpinJohn · ‎11-06-2024

I think there is some confusion about how these units are distributed.

Each "tile" on the processor die has three parts:

A block containing CPU core, including its private L1 and L2 caches
A block containing a "slice" of the distributed CHA+SF+LLC
- CHA == Coherence and Home Agent
- SF == Snoop Filter
- LLC == Last Level Cache
A block containing the "Common Mesh Stop" (CMS) hardware, connecting the CPU and CHA+SF+LLC blocks to the mesh interfaces in the UP/DOWN/LEFT/RIGHT directions.

At each "tile" the CMS block is always enabled and one or both of the other two blocks can be enabled, with the restriction that an enabled CPU core is always paired with an enabled CHA+SF+LLC block. Other than this requirement, there is no other association between a CPU core and its co-located CHA+SF+LLC block.

Each "slice" of the CHA+SF+LLC is responsible for a fraction of the overall address space, with consecutive cache lines assigned to different CHA+SF+LLC blocks using an undocumented pseudo-random permutation hash function.

When a memory request misses in a core's L1 & L2 caches, a translation agent at the interface between the CPU block and the on-chip mesh interconnect uses the physical address of the transaction to determine which of the CHA+SF+LLC blocks is responsible for that address, then sets up a mesh transaction to route the message from the current mesh location to the mesh location where the responsible CHA+SF+LLC is located. When the message arrives at that mesh location, it is sent to the CHA+SF+LLC block. These all perform different operations on the request

The CHA checks to see if the address corresponds to a transaction that is currently pending, and maintains responsibility for coherence and ordering for the transaction.
The SF keeps track of all of the lines held in the private L1 and L2 caches of the chip, so it looks up whether this new transaction is to an address that another core may have cached.
The LLC holds the portion of the shared L3 cache that includes the incoming address, so it looks up the address to see if the data is cached, and what the cache state is in the L3 cache.

Depending on the combination of results, the CHA will oversee the subsequent transactions, which may include making requests to DRAM, sending snoop requests to other sockets, etc.

So the number of "slices" of the CHA is the same as the number of "slices" of the SF is the same as the number of "slices" of the LLC, and that value is equal to the number of bits set in the CHA+SF+LLC CAPID register for that chip. A nearby CAPID register holds the bit map of the enabled CPU cores of the chip. On every system I have tested, a "set" value in the CPU core bit mask always has a matching "set" value in the CHA+SF+LLC bitmap, but the reverse is not required.

A different step in the mapping is related to the number of different address bit classes that the Snoop Filter and LLC use for internal addressing. For the SKX/CLX families 11 bits are used to index into the 2048 "congruence classes" of each SF slice and each LLC slice. Each of those "congruence classes" is set-associative, with SKX/CLX processors having 12-way-associative Snoop Filters and 11-way-associative L3 caches. It is hard to develop intuition about what these associativities mean in terms of contiguous blocks of memory because the address-to-slice hash means that consecutive addresses on different pages typically map to different CHA+SF+LLC slices. Consider accessing the first cache line in a set of 4KiB pages. The 5 bits of physical address above the 4KiB page boundary determine which of 32 ranges of 64 contiguous congruence classes will handle the page. If you pick a bunch of 4KiB pages for which the next 5 address bits are the same, the zero lines of the pages will all map to a single congruence class. BUT, different 4KiB pages will typically result in mapping to different CHA+SF+LLC "slices". In the best case you can spread the zero lines of the pages over 32 * NSLICES congruence classes, so those accesses will see an effective LLC associativity of 11 * 32 * NSLICES. In the worst case all of the pages map to the same slice and to the same 1 of 32 ranges within the slice, giving an effective LLC associativity of 11 way. That is a range of 896:1 for effective associativity on a 28-slice SKX or CLX processor.

OK, I will shut up and go home now.....

View solution in original post

ManiTofigh · ‎10-13-2024

Any take on this question would be appreciated @McCalpinJohn !

Mani

McCalpinJohn · ‎11-06-2024

I think there is some confusion about how these units are distributed.

Each "tile" on the processor die has three parts:

A block containing CPU core, including its private L1 and L2 caches
A block containing a "slice" of the distributed CHA+SF+LLC
- CHA == Coherence and Home Agent
- SF == Snoop Filter
- LLC == Last Level Cache
A block containing the "Common Mesh Stop" (CMS) hardware, connecting the CPU and CHA+SF+LLC blocks to the mesh interfaces in the UP/DOWN/LEFT/RIGHT directions.

At each "tile" the CMS block is always enabled and one or both of the other two blocks can be enabled, with the restriction that an enabled CPU core is always paired with an enabled CHA+SF+LLC block. Other than this requirement, there is no other association between a CPU core and its co-located CHA+SF+LLC block.

Each "slice" of the CHA+SF+LLC is responsible for a fraction of the overall address space, with consecutive cache lines assigned to different CHA+SF+LLC blocks using an undocumented pseudo-random permutation hash function.

When a memory request misses in a core's L1 & L2 caches, a translation agent at the interface between the CPU block and the on-chip mesh interconnect uses the physical address of the transaction to determine which of the CHA+SF+LLC blocks is responsible for that address, then sets up a mesh transaction to route the message from the current mesh location to the mesh location where the responsible CHA+SF+LLC is located. When the message arrives at that mesh location, it is sent to the CHA+SF+LLC block. These all perform different operations on the request

The CHA checks to see if the address corresponds to a transaction that is currently pending, and maintains responsibility for coherence and ordering for the transaction.
The SF keeps track of all of the lines held in the private L1 and L2 caches of the chip, so it looks up whether this new transaction is to an address that another core may have cached.
The LLC holds the portion of the shared L3 cache that includes the incoming address, so it looks up the address to see if the data is cached, and what the cache state is in the L3 cache.

Depending on the combination of results, the CHA will oversee the subsequent transactions, which may include making requests to DRAM, sending snoop requests to other sockets, etc.

So the number of "slices" of the CHA is the same as the number of "slices" of the SF is the same as the number of "slices" of the LLC, and that value is equal to the number of bits set in the CHA+SF+LLC CAPID register for that chip. A nearby CAPID register holds the bit map of the enabled CPU cores of the chip. On every system I have tested, a "set" value in the CPU core bit mask always has a matching "set" value in the CHA+SF+LLC bitmap, but the reverse is not required.

A different step in the mapping is related to the number of different address bit classes that the Snoop Filter and LLC use for internal addressing. For the SKX/CLX families 11 bits are used to index into the 2048 "congruence classes" of each SF slice and each LLC slice. Each of those "congruence classes" is set-associative, with SKX/CLX processors having 12-way-associative Snoop Filters and 11-way-associative L3 caches. It is hard to develop intuition about what these associativities mean in terms of contiguous blocks of memory because the address-to-slice hash means that consecutive addresses on different pages typically map to different CHA+SF+LLC slices. Consider accessing the first cache line in a set of 4KiB pages. The 5 bits of physical address above the 4KiB page boundary determine which of 32 ranges of 64 contiguous congruence classes will handle the page. If you pick a bunch of 4KiB pages for which the next 5 address bits are the same, the zero lines of the pages will all map to a single congruence class. BUT, different 4KiB pages will typically result in mapping to different CHA+SF+LLC "slices". In the best case you can spread the zero lines of the pages over 32 * NSLICES congruence classes, so those accesses will see an effective LLC associativity of 11 * 32 * NSLICES. In the worst case all of the pages map to the same slice and to the same 1 of 32 ranges within the slice, giving an effective LLC associativity of 11 way. That is a range of 896:1 for effective associativity on a 28-slice SKX or CLX processor.

OK, I will shut up and go home now.....

ManiTofigh · ‎11-06-2024

Thank you very much for your comprehensive response Dr. @McCalpinJohn! This was immensely helpful and also brought to my attention some of my misunderstandings. In fact, the image taken of the SKX die tile(s) at this WikiChip page makes a lot more sense with this explanation.

As a follow up question, as mentioned here: "shared L3 caches in Intel multicore processors are composed of “slices” (typically one “slice” per core)".

How does the die layout end up looking like in the non-typical case? Because you mentioned each enabled core must be paired with a CHA+SF+LLC block, although the reverse is not necessary. So I'd imagine the non-typical would be a case were we have more enabled CHA+SF+LLC blocks than cores as the other way around is not possible. But what is an example of such case? Is this for core failure cases or the intentional design of specific processors?

Thank you for your time in advance.

Mani

McCalpinJohn · ‎11-07-2024

There are only a small number of die layouts -- products that are not fully-configured have disabled CPU Cores and/or disabled CHA+SF+LLC units at existing tile locations.

The technical report available at https://hdl.handle.net/2152/89580 shows the statistics of patterns of disabled tiles in a sample of 4200 Xeon Phi 7250 ("Knights Landing") processors. My interpretation of the data is that 30%-35% of the disabled cores are disabled because they are defective, while 65%-70% of the disabled cores were disabled for other reasons. Read the report for more details and discussion. For these Xeon Phi 7250 processors, all 38 of the CHA+SF blocks are enabled, while both cores are disabled on each of 4 tiles -- providing 68 active cores on 34 tiles. Unlike later processors, on KNL the x2APIC IDs of the cores are not renumbered to skip over disabled cores. This makes analysis much easier -- just run the CPUID instruction on each core and look for x2APIC IDs that are missing from the resulting set. I only had to determine the mapping from x2APIC IDs to locations on the die once and could then do a direct lookup of the location for each missing (or present) x2APIC ID number.

For later processors based on the mesh architecture, the user-visible Logical Processor numbers (including the x2APIC IDs) are renumbered to skip over the disabled cores. This means that I had to search for a set of rules that Intel uses to map the active x2APIC IDs to locations on the chip. So far I have found these rules to be fixed for each generation of product, but they have sometimes changed between generations. Fortunately I was able to derive such rules for our local systems after examining only 5-6 nodes in detail. With subsequent systems it has been quicker -- the rules sometimes change, but the classes of rules are quite similar.

ManiTofigh · ‎11-08-2024

Thank you very much Dr. McCalpin! This was very helpful and answered all my concerns.

Best,

Mani