How to Interpret Latencies in Intel® 64 and IA-32 Architectures Optimization Reference Manual

Cao__Henry · ‎07-11-2018

Hi,

I am reading this http://www.intel.co.uk/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

And on page 2-5, it has a table talking about the speed/latency of Skylake's cache.

It has a column called Fastest Latency. Does it mean that's the smallest number of cycles required for writing a cache line (64 bytes)? Or reading? Or combined? Or?

And what's "Sustained Bandwidth"?

And in real life, how can one measure the latency occur at each cache level? Says I have a C++ or C program.

Thanks!

McCalpinJohn · ‎07-11-2018

Cache latency is always defined for loads, not stores. It is typically defined as the number of cycles between the execution of a load instruction and the first cycle in which a dependent instruction can execute using the data that was loaded.

Intel processors have a stupendously large number of different types of load instructions, with different widths, different alignments, different address computation modes, etc, and these can have different latencies.

The "fastest latency" is typically measured using 32-bit or 64-bit integer loads to general purpose registers. The latency is measured using a "pointer-chasing" loop, in which each load contains the address for the next load, so no out-of-order execution or speculative execution can "hide" (or overlap) the latency each load experiences waiting for the data to be returned from the cache. For data outside of the L1 Data Cache, it is still necessary to do something to prevent the hardware prefetchers from fetching the data in advance and biasing the results. By far the easiest approach is to disable the hardware prefetchers.

I don't think Intel has provided a precise definition of what they mean by "sustained bandwidth" in these tables, but it is clear that they are trying to help users with setting expectations. There are two difficulties that must be understood when trying to understand the maximum sustainable bandwidth of caches.

First, caches don't always have as many ports as one might expect. For example, Chapter 2 of the Optimization Reference Manual has different numbers for the peak L2 bandwidth for Haswell/Broadwell processors in different sections. In the (older) section on the Haswell microarchitecture, the table says that the peak L2 bandwidth is 64 Bytes/cycle. This statement is correct, but has confused many readers. The section on the Skylake Server microarchitecture refers to the maximum L2 bandwidth of Broadwell as 32 Bytes per cycle. This statement is also correct! The difference can be explained by considering how the L1 Data Cache operates when loading data from the L2 cache. The interface from the L2 cache to L1 cache allows a 64-byte cache line to move from the L2 cache to the L1 data cache in one cycle. (This gives the "peak" value from the Broadwell table.) BUT, in any cycle, the L1 cache can either receive a 64-Byte cache line from the L2 cache OR service two 32 Byte (AVX 256-bit) loads from the core, so the maximum throughput from the L2 cache to the core is 64 Bytes every 2 cycles, or 32 Bytes per cycle. (This gives the "maximum" value reported for Broadwell in the Skylake Server table.) The "sustained" value for Broadwell in the Skylake Server table is 25 Bytes/cycle. This represents certain conflicts that are not fully explained. One source of such conflicts is the L2 hardware prefetcher. Reading a stream of contiguous addresses from the L2 cache will cause L2 hardware prefetches to be generated (even if there are no L2 misses). The L2 prefetches compete with the loads missing the L1 for access to the L2 cache tags, and sometimes the accesses from the L1 are delayed a cycle to allow the L2 prefetch to go first. The specifics vary dynamically, but it not uncommon to see 2 cache lines transferred every 5 cycles (instead of 4 cycles), leading to a sustained bandwidth of 2*64/5 = 25.6 Bytes/cycle --- very close to the 25 Bytes per cycle reported in the table.

The second issue for "sustained bandwidth" is the complex interplay between loads and stores. In the Haswell/Broadwell microarchitecture, the L1 Data Cache can support very nearly the full bandwidth of 2 32-Byte loads plus 1 32-Byte store per cycle. The table in the Skylake Server section reports a sustained bandwidth of 93 Bytes/cycle vs a maximum bandwidth 96 Bytes/cycle. This is extraordinarily good. The Skylake Server processor must support 64-Byte loads and stores instead of 32-Byte loads and stores. The "sustained" L1 Data Cache bandwidth reported is 133 Bytes/cycle out of 192 Bytes/cycle peak. It is not clear whether this is due to the cache design being intrinsically more difficult, or if it is due to modifications in the design (to reduce power or latency) that reduce the ability to overlap loads and stores. My testing suggests that the Skylake Server processors can sustain very close to two 64 Byte loads per cycle (128 Bytes/cycle), but that introducing stores into the mix causes the reduction in efficiency.

Cao__Henry · ‎07-11-2018

Thanks for your detailed response.

I am looking at the table (on page 2-6) for Skylake.

It says, the fastest latency of shared L3 cache is 44 (cycles), and peak bandwidth is 32 bytes/cycle.

So according to your reply, fastest latency " is typically measured using 32-bit or 64-bit integer loads to general purpose registers", and peak bandwidth is measured by copying 1 cache line of data (64 bytes) from somewhere (I assume this can only be main memory??), and I assume this copy doesn't include the latency to copy the same cache line to L2 and then L1 right? If that's the case, it seems like to load 64 bytes of data from main memory to L1, it needs 2 cycles (for L1) + 1 cycle (for L2) + 1 cycle (for L1) at peak? It sounds too good to be true. I must misunderstand something.

McCalpinJohn · ‎07-12-2018

The bandwidth values are average throughput for sequences of load and/or store operations, where the sequence is long enough that the latency of the initial access is negligible compared to the transfer time.

The bandwidth values for various levels of cache assume that the data is already in the corresponding level of the cache when the measurement (or theoretical analysis) is started. The most common approach to measurement is to repeatedly access an array that fits in the target level of cache, but which is larger than the caches closer to the core. The number of repetitions is again chosen so that the time required to load the data from memory the first time is negligible compared to the time required for the transfers.

For testing load bandwidth from the L1 data cache, a typical array size is in the range of 24KiB to 32KiB. (It is usually a good idea to avoid trying to use 100% of the cache because any unexpected accesses will eviction of data that you are expecting to be in the L1 data cache, leading to potentially large variability in the results.) A 24KiB array contains 384 cache lines. The Skylake Server processor can load two cache lines per cycle (using 512-bit loads from the AVX512 instruction set), so the lower bound on the execution time is 192 cycles per repetition. If the data starts in the L1 cache, then the initial latency of about 4 cycles is already negligible compared to the 192 cycle minimum after a single repetition of the loop. Two caveats: (1) The overhead of timing is *not* negligible compared to 192 cycles, so multiple repetitions of the loop will be required to minimize timing overhead. (2) The data will not automatically start in the L1 Data Cache -- you have to either execute code to put it there before starting the test, or you need to run the test enough times that the time required to load the data the first time is negligible.

With respect to timing overhead, the lowest overhead is obtained with inline assembly code containing RDTSC, RDTSCP, or RDPMC instructions. It is critical to remember that RDTSC and RDTSCP return values that increment at the *base* frequency of the processor, while the RDPMC instruction can read events that increment at the *actual* frequency of the processor. Using the "rdpmc_actual_cycles()" function from https://github.com/jdmccalpin/low-overhead-timers on a Xeon Platinum 8160 processor, I measure average overheads of about 18-22 core cycles (depending on the compiler used and whether the interfaces were compiled to be inlined or separately linked). These instructions were slower on earlier processor models (closer to 40 cycles), but it is still trivial to perform enough repetitions of the (minimum) 192-cycle inner loop to make the timer overhead negligible.

With respect to the initial location of the data, you can estimate the time required to load the data into the L1 cache using a "typical" single-core load bandwidth of ~12 GB/s. 24 KiB / 12 GB/s = 2048 ns. This is about 7600 cycles at a maximum single-core Turbo frequency of 3.7 GHz, so you would need to repeat the inner loop (loading 24 KiB) at least 4000 times to ensure that the initial loading of the data took less than 1% of the total time. This seems like a lot of iterations, but 4000 iterations at 192 cycles per iteration is only about 0.0002 seconds at 3.7 GHz, so it should not take long enough to cause boredom.

For L2 bandwidth on Skylake Server, arrays in the range of 64KiB to (less than) 1024KiB are typically used. For arrays larger than 64 KiB, it is a good idea to make sure that your data is on large pages (typically via the Linux Transparent Hugepage mechanism) to avoid L2 cache conflicts due to random page coloring. The analysis of timing and overhead follows the model used above, but with different coefficients for the size and bandwidth. The same ~12 GB/s bandwidth for the initial load from memory can be used.

For L3 bandwidth, arrays larger than the 1MiB L2 and smaller than the shared L3 are required, and it is more important to use large pages. Otherwise the analysis is similar.

Travis_D_ · ‎09-01-2018

McCalpin, John wrote:

The interface from the L2 cache to L1 cache allows a 64-byte cache line to move from the L2 cache to the L1 data cache in one cycle. (This gives the "peak" value from the Broadwell table.) BUT, in any cycle, the L1 cache can either receive a 64-Byte cache line from the L2 cache OR service two 32 Byte (AVX 256-bit) loads from the core, so the maximum throughput from the L2 cache to the core is 64 Bytes every 2 cycles, or 32 Bytes per cycle. (This gives the "maximum" value reported for Broadwell in the Skylake Server table.)

I think your model (and Intel's figures) for the L2 behavior are slightly too pessimistic. I have a slightly refined model that I think more reflects the current hardware:

As you mention, the L1 can either receive a line from L2 or service (up to to 2) requests from the core. However when the L2 delivers its 64-byte payload to the L1, the core can also satisfy the load that triggered the line to be brought in from the L2, without accessing L1. At the hardware level this can be understood as the core reading the bytes requests by the load off of the same bus the L2 is using to move the bytes to the L1, without actually accessing the L1. It could be similar to how operands are bypassed from the output of one ALU operation to the input of another, when possible, without going through the register file.

Only one load can be satisfied in this way: if you load the two 32-byte halves of a cache line, the first load can use this mechanism, but the second load has to get its value from the L1 in a later cycle.

This means than in principle, in three cycles you can use 2 cycles to load lines from L2 to L1, and also satisfy two 32-byte loads, one from each of those lines, and use 1 cycle to read the other 2 32-byte halves from the L1. Of course your loads have to be very carefully arranged to achieve this, but in this way you can get close to 1.5 cycles per line, or ~43 bytes per cycle (in reality I have not achieved this figure!).