What is the data width of the mesh in SKX?

HadiBrais · ‎01-30-2020

Modern server processors that precede SKX use a ring on-die interconnect that is 32-byte wide in each direction. SKX and CSL processors use a mesh interconnect, but it's not clear to me whether the data network of the mesh was expanded to 64 bytes per cycle in each direction or remains to be 32-byte wide. Is there an Intel source that clarifies this?

McCalpinJohn · ‎01-30-2020

In the SKX Uncore Performance Monitoring Reference manual (https://software.intel.com/en-us/download/intel-xeon-processor-scalable-memory-family-uncore-performance-monitoring-reference-manual), Section 2.2.7.1 notes that on the BL mesh, two transfers are required to move one cache line.

My experiments confirm this -- two uncore cycles and two increments of [HORZ,VERT]_RING_BL_IN_USE for each cache line moved on a mesh segment.

HadiBrais · ‎01-30-2020

Nice.

As far as I know, just to check with you, the mesh interconnect in KNL and KNM is 64-byte wide in each direction. Am I correct? I have no access to these processors to do the experiments myself.

HadiBrais · ‎01-30-2020

Also the WikiChip article on SKX says that L2-L3 bus is 64-byte wide, but I didn't find an Intel source that confirms this. The bus between the private L2 cache and the per-tile L3 slice is internal within the tile, so at least in theory, it could be of a different width than that of the horizontal and vertical rings. But having 32-byte rings and 64-byte L2-L3 bus may not be very useful, so now I'm more suspicious about that the WikiChip article says. What do you think?

McCalpinJohn · ‎01-30-2020

All my measurements show 2 cycles per cache line on the KNL BL mesh -- same as SKX.

For SKX, the optimization manual says that the maximum bandwidth between the L3 and L2 is 16 Bytes/cycle -- no change from Haswell/Broadwell. This does not mean that the interface is 128 bits per cycle! If you compare Table 2-16 with Table 2-6 in the Intel optimization manual, you will see that the earlier table (2-16) says that Haswell/Broadwell have a "peak bandwidth" of 64 Bytes/cycle for the L2, while the newer table (2-6) reports a "max bandwidth" of 32 Bytes per cycle for Haswell/Broadwell, and 64 Bytes/cycle for Skylake Server. It appears that the actual interface is 64 Bytes wide, but that on Haswell/Broadwell, the L1 cache can only accept a line from the L2 every other cycle -- so the peak throughput is 32 Bytes/cycle. This is improved in SKX, but the behavior is also slightly more confusing. It is certainly possible to sustain more than 32 Bytes/cycle from the L2 through the L1 to the core, but the best sustained numbers are typically in the low 40's of Bytes/cycle (Intel claims 52 bytes/cycle, but I have never reached quite that level).

L3-contained BW is challenging to analyze because it is expected to be concurrency-limited, even assuming a narrow L2-L3 interface. Some rough numbers: Latency * BW = 64 cycles * 16 Bytes/cycle = 1024 Bytes = 16 cache lines. Since the L1 can only directly support ~10 misses, any additional concurrency must be provided by L2 HW prefetches, and these have to restart at the beginning of every 4KiB page. Of course the L3 latency is going to be different for every cache line due to address hashing for the distributed L3 slices.

The nice folks at FAU work on analytical performance models of cache accesses in Intel processors, e.g., https://blogs.fau.de/hager/files/2018/10/Hager_BrownBag_2018.pdf.

HadiBrais · ‎01-30-2020

Unfortunately, this information doesn't seem to conclusively answer the question. The [HORZ,VERT]_RING_BL_IN_USE events don't count packets, but cycles in use, so it appears to me that we cannot really rule out the possibility that the mesh in SKX and KNL may be 64-byte wide but it takes 2 cycles to transfer 64 bytes. An event that counts packets instead of cycles would enable us to differentiate between the two possible designs. Or maybe using partial uncacheable transactions to see whether a 32-byte transfer still takes 2 cycles or 1 cycle.

Table 2-6 says that the max L3 bandwidth is 16 bytes per cycle on both Broadwell and Skylake server. This indicates that the L2-L3 link has a lower bandwidth than the ring/mesh and that the WikiChip article is probably wrong in that regard. This also suggests that the width of the L2-L3 link is 16 bytes, 32 bytes, or, perhaps less likely, 64 bytes. With all of that information, it's still hard to say for sure what the widths of these interconnects are. What we know for sure is that the width of the ring in KNF is 64 bytes in each direction according to Slide 11 of these Intel slides.

McCalpinJohn · ‎01-30-2020

Slide 11 of that Intel presentation refers to Knights Ferry, which was a single (bidirectional) ring rather than a mesh. It makes sense that a ring would have to be significantly overprovisioned to provide adequate bandwidth per core.

The descriptions of the mesh on SKX and KNL appear almost identical. The topology and routing of the mesh lead to highly non-uniform utilization.

On a 28-core SKX/CLX system with all cores reading from DRAM, the most heavily utilized links are the "down" links from the two memory controllers. 18/28's of the cores get their data via those links. For DDR4/2933 DRAM, the peak BW is 70.4 GB/s per IMC, so up to 45.2 GB/s is needed on the "down" links of each IMC. At an uncore frequency of 2.6 GHz (the max value on my Xeon Platinum 8280 processors), the peak BW is 2.6*32 = 83.2 GB/s per direction per mesh link, so the two busiest mesh links would be just over 54% active at 100% DRAM read bandwidth.
The KNL configuration places the EDC controllers at the top and bottom, so the Y-X routing forces all MCDRAM loads from cores to use only one of the four possible mesh links that a tile can interface with. For the Xeon Phi 7250 processors at TACC, the maximum MCDRAM read BW I have measured is 368 GB/s, or 46 GB/s per EDC. At the measured uncore frequency of 1.7 GHz, each 32-Byte mesh link should have a peak BW of 54.4 GB/s. This puts the busiest mesh links at just under 85% utilization, which seems reasonable.