Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz

Peter_Johnson · ‎06-30-2020

Dear Dr. Bandwidth,

I am seeking for your comments again.

I implemented the sequential vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)

Now I found the execution time when n=10^8 is ~0.10 second. I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.

However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.

For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.

May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is?

Thank you so much for your time!

@McCalpinJohn