Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5019 Discussions

Results Understanding - Naive Question


My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
​The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
​So I decided to study one-threaded version at first.

I have got the following result
p1.pngp2.pngAs far as I understand the Summary Page, the situation is not very good.
​The paper says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by  just one thread.
​From the other hand according to Platform Page DRAM Bandwidth is not bottleneck.

​So my question what is the reason of bad memory metrics values?

​Thank you

0 Kudos
4 Replies
Honored Contributor III

Interpreting any performance result is much easier if we know something about the hardware....

The CPU time of 2.579 seconds is also very short -- longer tests are typically more reliable.

The timeline across the bottom of the picture appears to have had its labels cut off --- what value is being plotted here?

0 Kudos

Thank you for quick reply

1. My system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled

2. I have increased CPU time by running main cycle several times.
​Concurrency analysis gets the following results

Elapsed Time: 41.968s
    CPU Time: 22.863s
        Effective Time: 22.863s
            Idle: 0.001s
            Poor: 22.862s
            Ok: 0s
            Ideal: 0s
            Over: 0s
        Spin Time: 0s
        Overhead Time: 0s
    Wait Time: 0.000s
    Total Thread Count: 2
    Paused Time: 18.057s


Memory Access Analysis provides different CPU times: three consecutive  runs on the same amount of data
​Actual execution time was about 23 seconds as Concurrency Analysis says.

Elapsed Time: 40.744s
    CPU Time: 8.303s
    Memory Bound: 34.4%
        L1 Bound: 10.8%
        L2 Bound: 0.0%
        L3 Bound: 0.1%
        DRAM Bound: 16.8%
            Memory Bandwidth: 22.7%
            Memory Latency: 72.4%
    Loads: 17,385,660,000
    Stores: 7,386,000,000
    LLC Miss Count: 35,280,000
    Average Latency (cycles): 21
    Total Thread Count: 5
    Paused Time: 17.292s

Elapsed Time: 40.094s
    CPU Time: 4.617s
    Memory Bound: 33.9%
        L1 Bound: 9.7%
        L2 Bound: 0.0%
        L3 Bound: 0.1%
        DRAM Bound: 16.4%
            Memory Bandwidth: 72.9%
            Memory Latency: 20.0%
    Loads: 9,957,780,000
    Stores: 4,231,200,000
    LLC Miss Count: 19,740,000
    Average Latency (cycles): 21
    Total Thread Count: 5
    Paused Time: 17.355s

Elapsed Time: 40.549s
    CPU Time: 2.365s
    Memory Bound: 33.0%
        L1 Bound: 11.8%
        L2 Bound: 0.0%
        L3 Bound: 0.1%
        DRAM Bound: 16.1%
            Memory Bandwidth: 70.5%
            Memory Latency: 27.1%
    Loads: 4,605,780,000
    Stores: 1,969,200,000
    LLC Miss Count: 9,780,000
    Average Latency (cycles): 22
    Total Thread Count: 5
    Paused Time: 17.433s

3. On timeline chart
​The upper chart shows thread running state (green color) and CPU time (brown)
​The lower one shows DRAM Bandwidth (brown - Total, green - Read, red - Write)

0 Kudos


Can you please kindly do a General exploration and see if you are backend-bound? Also, if you have an IPS account, you may wish to send us your code snippets so we can analyse on our end.



0 Kudos
Here is Summary Page of General Exploration

Elapsed Time: 36.789s
    Clockticks: 50,970,200,000
    Instructions Retired: 55,040,500,000
    CPI Rate: 0.926
    MUX Reliability: 0.937
    Front-End Bound: 1.6%
        Front-End Latency: 0.8%
            ICache Misses: 0.1%
            ITLB Overhead: 0.0%
            Branch Resteers: 0.1%
            DSB Switches: 0.0%
            Length Changing Prefixes: 0.0%
            MS Switches: 1.0%
        Front-End Bandwidth: 0.8%
            Front-End Bandwidth MITE: 1.6%
            Front-End Bandwidth DSB: 0.7%
            Front-End Bandwidth LSD: 0.0%
    Bad Speculation: 0.2%
        Branch Mispredict: 0.0%
        Machine Clears: 0.2%
    Back-End Bound: 73.1%
        Memory Bound: 50.7%
            L1 Bound: 6.5%
                DTLB Overhead: 1.1%
                Loads Blocked by Store Forwarding: 0.1%
                Lock Latency: 0.0%
                Split Loads: 0.0%
                4K Aliasing: 4.0%
                FB Full: 0.0%
            L2 Bound: 0.0%
            L3 Bound: 0.1%
                Contested Accesses: 0.0%
                Data Sharing: 0.0%
                L3 Latency: 0.4%
                SQ Full: 0.4%
            DRAM Bound: 24.6%
                Memory Bandwidth: 29.5%
                Memory Latency: 55.0%
                    LLC Miss: 28.4%
            Store Bound: 11.9%
                Store Latency: 2.9%
                False Sharing: 0.0%
                Split Stores: 0.0%
                DTLB Store Overhead: 0.4%
        Core Bound: 22.4%
            Divider: 46.8%
            Port Utilization: 18.9%
                Cycles of 0 Ports Utilized: 33.0%
                Cycles of 1 Port Utilized: 16.7%
                Cycles of 2 Ports Utilized: 18.9%
                Cycles of 3+ Ports Utilized: 18.1%
                    Port 0: 27.4%
                    Port 1: 16.6%
                    Port 2: 29.5%
                    Port 3: 29.6%
                    Port 4: 27.3%
                    Port 5: 3.8%
    Retiring: 25.0%
        General Retirement: 24.4%
            FP Arithmetic: 39.5%
                FP x87: 0.0%
                FP Scalar: 39.5%
                FP Vector: 0.0%
            Other: 60.5%
        Microcode Sequencer: 0.7%
            Assists: 0.0%
    Total Thread Count: 2
    Paused Time: 22.473s

I attached my test program to the post. I am using Visual C++, Visual Studio 2015

0 Kudos