Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Topics
- Software Tuning, Performance Optimization & Platform Monitoring
- Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Peter_Johnson

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-30-2020
01:25 AM

66 Views

Understand the sequential vector dot product execution time on Intel Skylake with DDR4-2666 MHz

Dear Dr. Bandwidth,

I am seeking for your comments again.

I implemented the sequential vector dot product using naive AVX-512 SIMD intrinsics by unrolling the loop by 4 times. Vector dot product is to calculate: result = sum(x[i]*y[i]), i=1:n. (Data type: 64-bit double)

Now I found the execution time when n=10^8 is ~0.10 second. I can understand this result from memory bandwidth point of view. We actually loaded 2*8B * 10*8 in 0.1 second, leading to an effective memory bandwidth at ~16GB/s while the theoretical bandwidth is 2666 MHz * 8B/s = 20.8 GB/s. This is within my estimation.

However, when thinking from another aspect: estimating the performance via the real L1/L2/L3/TLB/DRAM latency, the result turns to be hard to understand for me.

For example, even though all the data are loaded from the L1 cache whose latency is 4ns, assuming a 3GHz CPU frequency, it already costs 2*10^8 * 4cycles/3GHz=0.27second, which is much longer than the experimental execution time 0.1s. This estimation does not yet consider L2/L3/DRAM/TLB latency yet and it's impossible that all the data start to reside in L1 cache when being loaded.

May I know how to understand this 0.1second execution time with respect to different components on the memory hierarchy, such as L1/L2/L3 cache, TLB and DRAM? It must be something wrong in my 0.27s-estimation. May I know where the flaw is?

Thank you so much for your time!

For more complete information about compiler optimizations, see our Optimization Notice.