Solved: Understand the sequential dot product on Skylake processor with DDR4-2666

Peter_Johnson · ‎06-30-2020

Dear Dr. Bandwidth,

I am seeking for your comments again.

I implemented the sequential vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)

Now I found the execution time when n=10^8 is ~0.10 second. I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.

However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.

For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.

May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is?

Thank you so much for your time!

@McCalpinJohn

McCalpinJohn · ‎06-30-2020

Sustained bandwidth is all about concurrency/parallelism/pipelining.

Recent systems require so much concurrency that they can be very hard to visualize, so I usually use this example from 2005. In the attached figure, the processor has a peak memory bandwidth of 6.4 GB/s, so it requires 64B/6.4 GB/s = 10ns transfer time on the DRAM interface to move a cache line. The latency for each transfer is 60ns. If you want to be receiving data continuously, you need to always have one cache line transfer posted 60ns ago, one 50ns ago, one 40ns ago, one 30ns ago, one 20ns ago, and one 10ns ago. So there must be six cache misses active and "in flight" at all times to "fill the pipeline" or "tolerate the latency". This slide is from http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/, where you can find a lot more discussion on latency, bandwidth, and concurrency.....

For you Skylake processor (and most recent processors), accesses to the L1 Data Cache can be fully pipelined (as long as they are independent). The core+L1DCache can execute two 32-byte loads per cycle, so transferring N cache lines will take N+4 cycles. For values of "N" that will fit in the cache (up to 512 for a 32KiB cache), these tests are very short, and it is difficult (but not impossible) to detect the extra four cycles required to get the pipeline filled.

Similar pipelining issues apply at the L2 and L3 levels, but with additional complications. For multi-level transfers, you need to consider that each level of cache that receives a cache line from a higher-numbered level (or memory) also has the *write* the data into the cache. Sometimes the data can be received by a cache and passed on to the next lower-numbered cache in a single transaction, but sometimes the cache requires two independent accesses --- one to write the data and later another transaction to read it. (This is the common case with hardware prefetching.) Vendors typically don't document the details of the cache implementation (number of read ports, write ports, banks, sub-banks, etc), so the added uncertainty about exactly how many transactions are taking place makes it even harder to understand what is going on....

For your Skylake system, the memory latency is probably in the range of 80ns. With a peak bandwidth of 20.8 GB/s and a latency of 80 ns, the system requires 20.8*80=1664 Bytes "in flight" to fully tolerate the memory latency. 1664 Bytes is 26 cache lines. The Skylake core only supports 12 L1 Data Cache misses, but the L2 hardware prefetchers can generate more concurrent reads from memory -- up to 20 or 24 cache lines. Assuming 80ns latency, the 16 GB/s observed bandwidth can be used to infer the average number of cache line transfers in flight -- 80ns*16GB/s = 1280 Bytes = 20 cache lines. This is completely consistent with the (very limited) documentation on the number buffers available for L2 cache misses.

View solution in original post

McCalpinJohn · ‎06-30-2020

Sustained bandwidth is all about concurrency/parallelism/pipelining.

Recent systems require so much concurrency that they can be very hard to visualize, so I usually use this example from 2005. In the attached figure, the processor has a peak memory bandwidth of 6.4 GB/s, so it requires 64B/6.4 GB/s = 10ns transfer time on the DRAM interface to move a cache line. The latency for each transfer is 60ns. If you want to be receiving data continuously, you need to always have one cache line transfer posted 60ns ago, one 50ns ago, one 40ns ago, one 30ns ago, one 20ns ago, and one 10ns ago. So there must be six cache misses active and "in flight" at all times to "fill the pipeline" or "tolerate the latency". This slide is from http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/, where you can find a lot more discussion on latency, bandwidth, and concurrency.....

For you Skylake processor (and most recent processors), accesses to the L1 Data Cache can be fully pipelined (as long as they are independent). The core+L1DCache can execute two 32-byte loads per cycle, so transferring N cache lines will take N+4 cycles. For values of "N" that will fit in the cache (up to 512 for a 32KiB cache), these tests are very short, and it is difficult (but not impossible) to detect the extra four cycles required to get the pipeline filled.

Similar pipelining issues apply at the L2 and L3 levels, but with additional complications. For multi-level transfers, you need to consider that each level of cache that receives a cache line from a higher-numbered level (or memory) also has the *write* the data into the cache. Sometimes the data can be received by a cache and passed on to the next lower-numbered cache in a single transaction, but sometimes the cache requires two independent accesses --- one to write the data and later another transaction to read it. (This is the common case with hardware prefetching.) Vendors typically don't document the details of the cache implementation (number of read ports, write ports, banks, sub-banks, etc), so the added uncertainty about exactly how many transactions are taking place makes it even harder to understand what is going on....

For your Skylake system, the memory latency is probably in the range of 80ns. With a peak bandwidth of 20.8 GB/s and a latency of 80 ns, the system requires 20.8*80=1664 Bytes "in flight" to fully tolerate the memory latency. 1664 Bytes is 26 cache lines. The Skylake core only supports 12 L1 Data Cache misses, but the L2 hardware prefetchers can generate more concurrent reads from memory -- up to 20 or 24 cache lines. Assuming 80ns latency, the 16 GB/s observed bandwidth can be used to infer the average number of cache line transfers in flight -- 80ns*16GB/s = 1280 Bytes = 20 cache lines. This is completely consistent with the (very limited) documentation on the number buffers available for L2 cache misses.

Peter_Johnson · ‎06-30-2020

Dear Dr. Bandwidth,

Thank you for your insightful comments. I see your point. The latency is heavily correlated with concurrency and the L2 hw prefetcher generates sufficient memory reads to hide the latency (little's law). You mentioned L2 cache supports ~20 misses from the buffer according to the very limited documents. May I know where these docs are so that I can learn from them?

Thank you very much!!

McCalpinJohn · ‎07-01-2020

Information like this is scattered and sometimes only implied by other statements....

Places that I have found implementation information include:

"Intel 64 and IA-32 Architectures Optimization Reference Manual", Intel document 248966. The most recent revision is -043, May 2020.
"Intel 64 and IA-32 Architectures Software Developer's Manual", particularly
- Volume 3, Intel document 325384, latest revision -072, May 2020.
- Volume 4, Intel document 335592, latest revision -072, May 2020.
The Intel "Uncore Performance Monitoring Reference Manual" for each processor family of interest. Examples include:
- Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual
- 6th Generation Intel® Core™ Processor Family Uncore Performance Monitoring Reference Manual
Agner Fog's documents from https://www.agner.org/optimize/, particularly
- microarchitecture.pdf
- instruction_tables.pdf
Intel's presentations at the Hot Chips conferences (www.hotchips.org)
Intel's papers at the International Solid State Circuits Conferences: http://isscc.org/

All of the documents with Intel document numbers are available via links from https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

Sometimes it takes a lot of cross-referencing the documents to realize that one of them has information that is not contained in the others. For example, the ISSCC 2018 presentation on the Skylake Xeon is the only presentation I have found that clearly and explicitly describes the number of sets and associativity of the "Snoop Filter" implementation on that chip.

An important source of implied information is in the description of performance counters for "occupancy". These counters increment by the number of entries in a queue or buffer that is being monitored. If the documentation includes a statement of the maximum number of increments per cycle, that is implicitly the maximum number of entries that the queue or buffer can hold.

Experimentation is often required to make reasonable guesses about the sizes of various buffers. This is a very advanced topic, requiring a lot of experience, a lot of patience, and a willingness to accept failure....