- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
So I decided to study one-threaded version at first.
I have got the following result
As far as I understand the Summary Page, the situation is not very good.
The paper https://software.intel.com/en-us/articles/finding-your-memory-access-performance-bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by just one thread.
From the other hand according to Platform Page DRAM Bandwidth is not bottleneck.
So my question what is the reason of bad memory metrics values?
Thank you
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interpreting any performance result is much easier if we know something about the hardware....
The CPU time of 2.579 seconds is also very short -- longer tests are typically more reliable.
The timeline across the bottom of the picture appears to have had its labels cut off --- what value is being plotted here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for quick reply
1. My system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled
2. I have increased CPU time by running main cycle several times.
Concurrency analysis gets the following results
Elapsed Time: 41.968s
CPU Time: 22.863s
Effective Time: 22.863s
Idle: 0.001s
Poor: 22.862s
Ok: 0s
Ideal: 0s
Over: 0s
Spin Time: 0s
Overhead Time: 0s
Wait Time: 0.000s
Total Thread Count: 2
Paused Time: 18.057s
Memory Access Analysis provides different CPU times: three consecutive runs on the same amount of data
Actual execution time was about 23 seconds as Concurrency Analysis says.
Elapsed Time: 40.744s
CPU Time: 8.303s
Memory Bound: 34.4%
L1 Bound: 10.8%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 16.8%
Memory Bandwidth: 22.7%
Memory Latency: 72.4%
Loads: 17,385,660,000
Stores: 7,386,000,000
LLC Miss Count: 35,280,000
Average Latency (cycles): 21
Total Thread Count: 5
Paused Time: 17.292s
Elapsed Time: 40.094s
CPU Time: 4.617s
Memory Bound: 33.9%
L1 Bound: 9.7%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 16.4%
Memory Bandwidth: 72.9%
Memory Latency: 20.0%
Loads: 9,957,780,000
Stores: 4,231,200,000
LLC Miss Count: 19,740,000
Average Latency (cycles): 21
Total Thread Count: 5
Paused Time: 17.355s
Elapsed Time: 40.549s
CPU Time: 2.365s
Memory Bound: 33.0%
L1 Bound: 11.8%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 16.1%
Memory Bandwidth: 70.5%
Memory Latency: 27.1%
Loads: 4,605,780,000
Stores: 1,969,200,000
LLC Miss Count: 9,780,000
Average Latency (cycles): 22
Total Thread Count: 5
Paused Time: 17.433s
3. On timeline chart
The upper chart shows thread running state (green color) and CPU time (brown)
The lower one shows DRAM Bandwidth (brown - Total, green - Read, red - Write)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ayrat,
Can you please kindly do a General exploration and see if you are backend-bound? Also, if you have an IPS account, you may wish to send us your code snippets so we can analyse on our end.
Regards,
Shailen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is Summary Page of General Exploration Elapsed Time: 36.789s Clockticks: 50,970,200,000 Instructions Retired: 55,040,500,000 CPI Rate: 0.926 MUX Reliability: 0.937 Front-End Bound: 1.6% Front-End Latency: 0.8% ICache Misses: 0.1% ITLB Overhead: 0.0% Branch Resteers: 0.1% DSB Switches: 0.0% Length Changing Prefixes: 0.0% MS Switches: 1.0% Front-End Bandwidth: 0.8% Front-End Bandwidth MITE: 1.6% Front-End Bandwidth DSB: 0.7% Front-End Bandwidth LSD: 0.0% Bad Speculation: 0.2% Branch Mispredict: 0.0% Machine Clears: 0.2% Back-End Bound: 73.1% Memory Bound: 50.7% L1 Bound: 6.5% DTLB Overhead: 1.1% Loads Blocked by Store Forwarding: 0.1% Lock Latency: 0.0% Split Loads: 0.0% 4K Aliasing: 4.0% FB Full: 0.0% L2 Bound: 0.0% L3 Bound: 0.1% Contested Accesses: 0.0% Data Sharing: 0.0% L3 Latency: 0.4% SQ Full: 0.4% DRAM Bound: 24.6% Memory Bandwidth: 29.5% Memory Latency: 55.0% LLC Miss: 28.4% Store Bound: 11.9% Store Latency: 2.9% False Sharing: 0.0% Split Stores: 0.0% DTLB Store Overhead: 0.4% Core Bound: 22.4% Divider: 46.8% Port Utilization: 18.9% Cycles of 0 Ports Utilized: 33.0% Cycles of 1 Port Utilized: 16.7% Cycles of 2 Ports Utilized: 18.9% Cycles of 3+ Ports Utilized: 18.1% Port 0: 27.4% Port 1: 16.6% Port 2: 29.5% Port 3: 29.6% Port 4: 27.3% Port 5: 3.8% Retiring: 25.0% General Retirement: 24.4% FP Arithmetic: 39.5% FP x87: 0.0% FP Scalar: 39.5% FP Vector: 0.0% Other: 60.5% Microcode Sequencer: 0.7% Assists: 0.0% Total Thread Count: 2 Paused Time: 22.473s
I attached my test program to the post. I am using Visual C++, Visual Studio 2015
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page