<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Understand-the-sequential-vector-dot-product-execution-time-on/m-p/1188289#M7543</link>
    <description>&lt;P&gt;Dear Dr. Bandwidth,&lt;/P&gt;
&lt;P&gt;I am seeking for your comments again.&lt;/P&gt;
&lt;P&gt;I implemented the sequential&amp;nbsp;vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)&lt;/P&gt;
&lt;P&gt;Now I found the execution time when n=10^8 is ~0.10 second.&amp;nbsp;I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.&lt;/P&gt;
&lt;P&gt;However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.&lt;/P&gt;
&lt;P&gt;For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.&lt;/P&gt;
&lt;P&gt;May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you so much for your time!&lt;/P&gt;
&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/89357"&gt;@McCalpinJohn&lt;/a&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 30 Jun 2020 08:36:08 GMT</pubDate>
    <dc:creator>Peter_Johnson</dc:creator>
    <dc:date>2020-06-30T08:36:08Z</dc:date>
    <item>
      <title>Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Understand-the-sequential-vector-dot-product-execution-time-on/m-p/1188289#M7543</link>
      <description>&lt;P&gt;Dear Dr. Bandwidth,&lt;/P&gt;
&lt;P&gt;I am seeking for your comments again.&lt;/P&gt;
&lt;P&gt;I implemented the sequential&amp;nbsp;vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)&lt;/P&gt;
&lt;P&gt;Now I found the execution time when n=10^8 is ~0.10 second.&amp;nbsp;I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.&lt;/P&gt;
&lt;P&gt;However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.&lt;/P&gt;
&lt;P&gt;For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.&lt;/P&gt;
&lt;P&gt;May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you so much for your time!&lt;/P&gt;
&lt;P&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/89357"&gt;@McCalpinJohn&lt;/a&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 30 Jun 2020 08:36:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Understand-the-sequential-vector-dot-product-execution-time-on/m-p/1188289#M7543</guid>
      <dc:creator>Peter_Johnson</dc:creator>
      <dc:date>2020-06-30T08:36:08Z</dc:date>
    </item>
  </channel>
</rss>

