Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
1635 Discussions

Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz


Dear Dr. Bandwidth,

I am seeking for your comments again.

I implemented the sequential vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)

Now I found the execution time when n=10^8 is ~0.10 second. I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.

However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.

For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.

May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is? 

Thank you so much for your time!


0 Kudos
0 Replies