Modern server processors that precede SKX use a ring on-die interconnect that is 32-byte wide in each direction. SKX and CSL processors use a mesh interconnect, but it's not clear to me whether the data network of the mesh was expanded to 64 bytes per cycle in each direction or remains to be 32-byte wide. Is there an Intel source that clarifies this?
In the SKX Uncore Performance Monitoring Reference manual (https://software.intel.com/en-us/download/intel-xeon-processor-scalable-memory-family-uncore-performance-monitoring-reference-manual), Section 18.104.22.168 notes that on the BL mesh, two transfers are required to move one cache line.
My experiments confirm this -- two uncore cycles and two increments of [HORZ,VERT]_RING_BL_IN_USE for each cache line moved on a mesh segment.
As far as I know, just to check with you, the mesh interconnect in KNL and KNM is 64-byte wide in each direction. Am I correct? I have no access to these processors to do the experiments myself.
Also the WikiChip article on SKX says that L2-L3 bus is 64-byte wide, but I didn't find an Intel source that confirms this. The bus between the private L2 cache and the per-tile L3 slice is internal within the tile, so at least in theory, it could be of a different width than that of the horizontal and vertical rings. But having 32-byte rings and 64-byte L2-L3 bus may not be very useful, so now I'm more suspicious about that the WikiChip article says. What do you think?
All my measurements show 2 cycles per cache line on the KNL BL mesh -- same as SKX.
For SKX, the optimization manual says that the maximum bandwidth between the L3 and L2 is 16 Bytes/cycle -- no change from Haswell/Broadwell. This does not mean that the interface is 128 bits per cycle! If you compare Table 2-16 with Table 2-6 in the Intel optimization manual, you will see that the earlier table (2-16) says that Haswell/Broadwell have a "peak bandwidth" of 64 Bytes/cycle for the L2, while the newer table (2-6) reports a "max bandwidth" of 32 Bytes per cycle for Haswell/Broadwell, and 64 Bytes/cycle for Skylake Server. It appears that the actual interface is 64 Bytes wide, but that on Haswell/Broadwell, the L1 cache can only accept a line from the L2 every other cycle -- so the peak throughput is 32 Bytes/cycle. This is improved in SKX, but the behavior is also slightly more confusing. It is certainly possible to sustain more than 32 Bytes/cycle from the L2 through the L1 to the core, but the best sustained numbers are typically in the low 40's of Bytes/cycle (Intel claims 52 bytes/cycle, but I have never reached quite that level).
L3-contained BW is challenging to analyze because it is expected to be concurrency-limited, even assuming a narrow L2-L3 interface. Some rough numbers: Latency * BW = 64 cycles * 16 Bytes/cycle = 1024 Bytes = 16 cache lines. Since the L1 can only directly support ~10 misses, any additional concurrency must be provided by L2 HW prefetches, and these have to restart at the beginning of every 4KiB page. Of course the L3 latency is going to be different for every cache line due to address hashing for the distributed L3 slices.
The nice folks at FAU work on analytical performance models of cache accesses in Intel processors, e.g., https://blogs.fau.de/hager/files/2018/10/Hager_BrownBag_2018.pdf.
Unfortunately, this information doesn't seem to conclusively answer the question. The [HORZ,VERT]_RING_BL_IN_USE events don't count packets, but cycles in use, so it appears to me that we cannot really rule out the possibility that the mesh in SKX and KNL may be 64-byte wide but it takes 2 cycles to transfer 64 bytes. An event that counts packets instead of cycles would enable us to differentiate between the two possible designs. Or maybe using partial uncacheable transactions to see whether a 32-byte transfer still takes 2 cycles or 1 cycle.
Table 2-6 says that the max L3 bandwidth is 16 bytes per cycle on both Broadwell and Skylake server. This indicates that the L2-L3 link has a lower bandwidth than the ring/mesh and that the WikiChip article is probably wrong in that regard. This also suggests that the width of the L2-L3 link is 16 bytes, 32 bytes, or, perhaps less likely, 64 bytes. With all of that information, it's still hard to say for sure what the widths of these interconnects are. What we know for sure is that the width of the ring in KNF is 64 bytes in each direction according to Slide 11 of these Intel slides.
Slide 11 of that Intel presentation refers to Knights Ferry, which was a single (bidirectional) ring rather than a mesh. It makes sense that a ring would have to be significantly overprovisioned to provide adequate bandwidth per core.
The descriptions of the mesh on SKX and KNL appear almost identical. The topology and routing of the mesh lead to highly non-uniform utilization.