topic Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz in Software Tuning, Performance Optimization & Platform Monitoring
https://community.intel.com/t5/Software-Tuning-Performance/Understand-the-sequential-vector-dot-product-execution-time-on/m-p/1188289#M7543
<P>Dear Dr. Bandwidth,</P>
<P>I am seeking for your comments again.</P>
<P>I implemented the sequential vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)</P>
<P>Now I found the execution time when n=10^8 is ~0.10 second. I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.</P>
<P>However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.</P>
<P>For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.</P>
<P>May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is? </P>
<P>Thank you so much for your time!</P>
<P><LI-USER uid="89357"></LI-USER> </P>Tue, 30 Jun 2020 08:36:08 GMTPeter_Johnson2020-06-30T08:36:08ZUnderstand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz
https://community.intel.com/t5/Software-Tuning-Performance/Understand-the-sequential-vector-dot-product-execution-time-on/m-p/1188289#M7543
<P>Dear Dr. Bandwidth,</P>
<P>I am seeking for your comments again.</P>
<P>I implemented the sequential vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)</P>
<P>Now I found the execution time when n=10^8 is ~0.10 second. I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.</P>
<P>However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.</P>
<P>For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.</P>
<P>May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is? </P>
<P>Thank you so much for your time!</P>
<P><LI-USER uid="89357"></LI-USER> </P>Tue, 30 Jun 2020 08:36:08 GMThttps://community.intel.com/t5/Software-Tuning-Performance/Understand-the-sequential-vector-dot-product-execution-time-on/m-p/1188289#M7543Peter_Johnson2020-06-30T08:36:08Z