I'm currently investigating Intel's Cache Allocation Technology and was wondering if anyone could give me any insight into the cache way organization on my Skylake (Xeon Gold 6148) CPU. Particularly, I'm interested to know if all 11 cache ways in the L3 are present on every single core, or if each way is only distributed over a few cores (and what that distribution is). This information will help me ensure the locality of my computation.
CPU model-specific considerations may not be addressed well on this forum, which seemed in the past to be directed toward portable programming practices which can scale up as processors support more parallelism. As Intel appears to have lost interest in the forums as currently constituted, it's only my personal opinion.
You may be interested in the Line Fill Buffer and superqueue implications, which do pose a limit to number of parallel data streams and may be addressed at application level for LFB. Any associated superqueue effects are probably model specific and hidden from public view to the extent that Intel is able to accomplish it. At the L2 and L3 level, these normally overshadow any cache way limitations, possibly as a matter of design to accommodate inclusive and exclusive caches, as they come into play as soon as there are L1 evictions due to number of ways, capacity, or whatever. In the past, the number of ways supported by L1 was a significant limitation on hyperthreading performance; not so much since Nehalem introduced the LFB version of write combining.
It seems that Intel made a small adjustment to LFB in Skylake, at least on the enterprise CPU models, such that each thread on each core should have 1 to 11 LFBs available, the number evidently depending on competition from other threads running on that core. This would therefore not change the effect that applications which need more than half that number of parallel streams are poor candidates for hyperthreading. Where it is convenient to split inner loops and use 1 thread per core, limiting the number of separate data streams stored in one loop to 9 or 10 should not encounter LFB saturation, and tuning for previous CPU models should be adequate. To some extent, Intel compilers accomplish this automatically at -O3. Compiler opt_report will give some indication whether this happened, short of serious run-time profiling by VTune or equivalent. In principle, then, number of parallel data streams should not impose a limit on scaling to larger numbers of cores, but VTune can give indications whether this is realized in practice on a specific platform.
This is as much detail as I am prepared to discuss, and I think possibly the limit to which such discussions are suitable on this forum. You will see interesting discussions in further detail on StackOverflow.
The L3 cache is distributed in a way that makes "locality" (effectively) impossible to achieve. I have done a couple of presentations on the address hashing used in the Xeon Scalable processors, including the Xeon Gold 6148.
On the Xeon Gold 6148, every aligned block of 256 cacheline addresses is distributed across the 20 CHA/L3 blocks in a pseudo-random pattern, with 13 cache lines assigned to each of CHA/L3 blocks 0-15 and 12 cache lines assigned to each of CHA/L3 blocks 16-19. There are 256 different 256-block assignment patterns, each being a "binary permutation" of each of the others, with the 8-bit "binary permutation number" selected by 8 XOR-reductions of different subsets of the higher-order address bits. There are more details in the presentation at https://www.ixpug.org/components/com_solutionlibrary/assets/documents/1538092216-IXPUG_Fall_Conf_201...
So each L3 "slice" is 11-way set-associative, but different 4KiB pages will map their 64 cacheline addresses across the 20 L3 slices in one of 256 different permutations. My data on the Xeon Gold 6148 shows that there are only 256 unique patterns for any power-of-2 blocksize of 4KiB or larger. For each 4KiB block (64 cache lines)
Unfortunately, this "slice select hash" is only part of the story. The address hash also maps each address to one of the 2048 "sets" within the target L3 slice, and this "set select hash" is nearly impossible to derive analytically -- without any way to pin down the set numbers, there are factorial(2048) possible orderings of the assignment of guesses for set select numbering to the actual set select numbers....
Even without knowledge of the "set select hash", the worst case behavior can be described -- since each 4KiB page assigns 4 cachelines to each of 6 of the CHAs, it is possible to overflow up to 6 L3 sets using as few as 3 4KiB pages.
To make it worse, the L3 only receives L2 victims if the processor hardware decides that there is a good probability that subsequent accesses to the line will occur soon enough to find the data in the L3. There is no documentation on the nature of the heuristics used by this mechanism, but there are performance counters that give an indication of how many of the "L2 victims caused by L2 fills" are sent to the L3 (IDI_MISC.WB_UPGRADE) and how many of the "L2 victims caused by L2 fills" are either dropped (if clean) or sent straight back to memory (if dirty) (IDI_MISC.WB.DOWNGRADE). I say "an indication" because these events don't appear to count L2 victims caused by interventions, and they don't always add up to the total number of "L2 victims caused by L2 fills" expected (though it is usually close -- within 5% or so).
This was probably a completely useless answer for you, but I have been working on this for so long that it just comes out all at once.....
Distributing the addresses across the "slices" of the L3 (and CHA and Snoop Filter) is the only way to maintain scalability. With this pseudo-random distribution across "P" L3/CHA/SnoopFilter slices, each slice only has to process 1/P of the traffic, or an average of 1 core's traffic if there are "P" cores. Private L3's would give lower latency (for working sets small enough to fit), but then the Snoop Filter would need to be extended to track the data in the L3 caches as well as the L2/L1 caches. When "P" is not a power of 2, the options for hashing addresses to slices involve unpleasant tradeoffs between uniformity, latency, and power consumption.
You can get better latency and lower power consumption if you split everything up as far as possible -- e.g., for a 24-core/6-DRAM-channel processor, a NUMA node could hypothetically consist of *one* DRAM channel and 4 cores (located close together on the die), with the addresses for that node hashed across only 4 L3/CHA/SnoopFilter slices for processing. In this case the next-level directories would need a lot more memory, but you would get lower local latency for L3 hits (on a much smaller L3) and lower latency on local DRAM accesses (for a much smaller amount of memory) in exchange for higher latency for everything else. Such specialization is a good idea for a fixed-function system, but the pain generally outweighs the cost for general-purpose systems.