Results Understanding - Naive Question

Ayrat_S_ · ‎11-02-2016

My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
So I decided to study one-threaded version at first.

I have got the following result
As far as I understand the Summary Page, the situation is not very good.
The paper https://software.intel.com/en-us/articles/finding-your-memory-access-performance-bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by just one thread.
From the other hand according to Platform Page DRAM Bandwidth is not bottleneck.

So my question what is the reason of bad memory metrics values?

Thank you

McCalpinJohn · ‎11-02-2016

Interpreting any performance result is much easier if we know something about the hardware....

The CPU time of 2.579 seconds is also very short -- longer tests are typically more reliable.

The timeline across the bottom of the picture appears to have had its labels cut off --- what value is being plotted here?

Ayrat_S_ · ‎11-02-2016

Thank you for quick reply

1. My system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled

2. I have increased CPU time by running main cycle several times.
Concurrency analysis gets the following results

Elapsed Time: 41.968s
    CPU Time: 22.863s
        Effective Time: 22.863s
            Idle: 0.001s
            Poor: 22.862s
            Ok: 0s
            Ideal: 0s
            Over: 0s
        Spin Time: 0s
        Overhead Time: 0s
    Wait Time: 0.000s
    Total Thread Count: 2
    Paused Time: 18.057s

Memory Access Analysis provides different CPU times: three consecutive runs on the same amount of data
Actual execution time was about 23 seconds as Concurrency Analysis says.

Elapsed Time: 40.744s
    CPU Time: 8.303s
    Memory Bound: 34.4%
        L1 Bound: 10.8%
        L2 Bound: 0.0%
        L3 Bound: 0.1%
        DRAM Bound: 16.8%
            Memory Bandwidth: 22.7%
            Memory Latency: 72.4%
    Loads: 17,385,660,000
    Stores: 7,386,000,000
    LLC Miss Count: 35,280,000
    Average Latency (cycles): 21
    Total Thread Count: 5
    Paused Time: 17.292s

Elapsed Time: 40.094s
    CPU Time: 4.617s
    Memory Bound: 33.9%
        L1 Bound: 9.7%
        L2 Bound: 0.0%
        L3 Bound: 0.1%
        DRAM Bound: 16.4%
            Memory Bandwidth: 72.9%
            Memory Latency: 20.0%
    Loads: 9,957,780,000
    Stores: 4,231,200,000
    LLC Miss Count: 19,740,000
    Average Latency (cycles): 21
    Total Thread Count: 5
    Paused Time: 17.355s

Elapsed Time: 40.549s
    CPU Time: 2.365s
    Memory Bound: 33.0%
        L1 Bound: 11.8%
        L2 Bound: 0.0%
        L3 Bound: 0.1%
        DRAM Bound: 16.1%
            Memory Bandwidth: 70.5%
            Memory Latency: 27.1%
    Loads: 4,605,780,000
    Stores: 1,969,200,000
    LLC Miss Count: 9,780,000
    Average Latency (cycles): 22
    Total Thread Count: 5
    Paused Time: 17.433s

3. On timeline chart
The upper chart shows thread running state (green color) and CPU time (brown)
The lower one shows DRAM Bandwidth (brown - Total, green - Read, red - Write)

Shailen_Sobhee · ‎11-08-2016

Ayrat,

Can you please kindly do a General exploration and see if you are backend-bound? Also, if you have an IPS account, you may wish to send us your code snippets so we can analyse on our end.

Regards,

Shailen

Ayrat_S_ · ‎11-08-2016

Here is Summary Page of General Exploration


Elapsed Time: 36.789s
    Clockticks: 50,970,200,000
    Instructions Retired: 55,040,500,000
    CPI Rate: 0.926
    MUX Reliability: 0.937
    Front-End Bound: 1.6%
        Front-End Latency: 0.8%
            ICache Misses: 0.1%
            ITLB Overhead: 0.0%
            Branch Resteers: 0.1%
            DSB Switches: 0.0%
            Length Changing Prefixes: 0.0%
            MS Switches: 1.0%
        Front-End Bandwidth: 0.8%
            Front-End Bandwidth MITE: 1.6%
            Front-End Bandwidth DSB: 0.7%
            Front-End Bandwidth LSD: 0.0%
    Bad Speculation: 0.2%
        Branch Mispredict: 0.0%
        Machine Clears: 0.2%
    Back-End Bound: 73.1%
        Memory Bound: 50.7%
            L1 Bound: 6.5%
                DTLB Overhead: 1.1%
                Loads Blocked by Store Forwarding: 0.1%
                Lock Latency: 0.0%
                Split Loads: 0.0%
                4K Aliasing: 4.0%
                FB Full: 0.0%
            L2 Bound: 0.0%
            L3 Bound: 0.1%
                Contested Accesses: 0.0%
                Data Sharing: 0.0%
                L3 Latency: 0.4%
                SQ Full: 0.4%
            DRAM Bound: 24.6%
                Memory Bandwidth: 29.5%
                Memory Latency: 55.0%
                    LLC Miss: 28.4%
            Store Bound: 11.9%
                Store Latency: 2.9%
                False Sharing: 0.0%
                Split Stores: 0.0%
                DTLB Store Overhead: 0.4%
        Core Bound: 22.4%
            Divider: 46.8%
            Port Utilization: 18.9%
                Cycles of 0 Ports Utilized: 33.0%
                Cycles of 1 Port Utilized: 16.7%
                Cycles of 2 Ports Utilized: 18.9%
                Cycles of 3+ Ports Utilized: 18.1%
                    Port 0: 27.4%
                    Port 1: 16.6%
                    Port 2: 29.5%
                    Port 3: 29.6%
                    Port 4: 27.3%
                    Port 5: 3.8%
    Retiring: 25.0%
        General Retirement: 24.4%
            FP Arithmetic: 39.5%
                FP x87: 0.0%
                FP Scalar: 39.5%
                FP Vector: 0.0%
            Other: 60.5%
        Microcode Sequencer: 0.7%
            Assists: 0.0%
    Total Thread Count: 2
    Paused Time: 22.473s

I attached my test program to the post. I am using Visual C++, Visual Studio 2015