vtune profiler 2020: memory access collector, LLC count zero

Sarpangala_Venkatesh · ‎02-02-2020

Operating system and version

NAME=Fedora
VERSION="27 (Twenty Seven)"
ID=fedora
VERSION_ID=27
PRETTY_NAME="Fedora 27 (Twenty Seven)"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:fedoraproject:fedora:27"
HOME_URL="https://fedoraproject.org/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=27
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=27
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
Fedora release 27 (Twenty Seven)
Fedora release 27 (Twenty Seven)
cpe:/o:fedoraproject:fedora:27

Kernel version: 4.18.8-100.fc27.x86_64

Tool version

Intel(R) VTune(TM) Profiler 2020 (build 605129) Command Line Tool
Copyright (C) 2009-2019 Intel Corporation. All rights reserved.

Compiler version
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)

Steps to reproduce the error

vtunes with memory-access collector is always showing LLC miss counts as zero. And also the persistent memory bound is 0% which is suspect considering the high number of parallel IO threads. I have tried changing the kernel(5.0, 5.3 on another system with similar CPUs), use driverless mode instead of event-based profiling, rebuilding vtune with different options per user vs system wife, allowing multiple runs to disallow multiplexing PMC counters. Nothing as worked so far and have always seen LLC counter to zero.

Following is the vtune log of a sample run demonstrating the problem. I would highly appreciate it if someone could throw some light on this issue. Thanks in advance

    CPU Time: 75.415s
    Memory Bound: 48.7% of Pipeline Slots
     | The metric value is high. This may indicate that a significant fraction
     | of execution pipeline slots could be stalled due to demand memory load
     | and stores. Explore the metric breakdown by memory hierarchy, memory
     | bandwidth information, and correlation by memory objects.
     |
        L1 Bound: 17.8% of Clockticks
         | This metric shows how often machine was stalled without missing the
         | L1 data cache. The L1 cache typically has the shortest latency.
         | However, in certain cases like loads blocked on older stores, a load
         | might suffer a high latency even though it is being satisfied by the
         | L1.
         |
        L2 Bound: 0.2% of Clockticks
        L3 Bound: 0.3% of Clockticks
        DRAM Bound: 0.0% of Clockticks
            DRAM Bandwidth Bound: 0.0% of Elapsed Time
        Store Bound: 12.6% of Clockticks
        NUMA: % of Remote Accesses: 0.0%
        UPI Utilization Bound: 0.0% of Elapsed Time
        Persistent Memory Bound: 0.0% of Clockticks
            Persistent Memory Bandwidth Bound: 0.0% of Elapsed Time
    Loads: 87,025,610,690
    Stores: 36,653,099,560
    LLC Miss Count: 0
        Local DRAM Access Count: 0
        Remote DRAM Access Count: 0
        Local Persistent Memory Access Count: 0
        Remote Persistent Memory Access Count: 0
        Remote Cache Access Count: 0
    Average Latency (cycles): 48
    Total Thread Count: 30
    Paused Time: 0s

Bandwidth Utilization
Bandwidth Domain                          Platform Maximum  Observed Maximum  Average  % of Elapsed Time with High BW Utilization(%)
----------------------------------------  ----------------  ----------------  -------  ---------------------------------------------
DRAM, GB/sec                              220                         18.300    6.460                                           0.0%
DRAM Single-Package, GB/sec               110                         18.200    6.706                                           0.0%
UPI Utilization Single-link, (%)          100                         10.800    6.538                                           0.0%
Persistent Memory, GB/sec                 60                          10.600    5.603                                           0.0%
Persistent Memory Single-Package, GB/sec  30                          10.600    5.207                                           0.0%
Collection and Platform Info
    Application Command Line: mpirun "--cpu-set" "24-43" "-np" "20" "--wdir" "./writer" "--bind-to" "core" "--mca" "btl" "tcp,self" "../workflowwriters" "1" "67108864" "16"
    User Name: ranjan
    Operating System: 4.18.8-100.fc27.x86_64 NAME=Fedora VERSION="27 (Twenty Seven)" ID=fedora VERSION_ID=27 PRETTY_NAME="Fedora 27 (Twenty Seven)" ANSI_COLOR="0;34" CPE_NAME="cpe:/o:fedoraproject:fedora:27" HOME_URL="https://fedoraproject.org/" SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=27 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=27 PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
    Computer Name: aep03
    Result Size: 55 MB
    Collection start time: 23:38:34 31/01/2020 UTC
    Collection stop time: 23:38:43 31/01/2020 UTC

Collector Type: Event-based sampling driver
    CPU
        Name: Intel(R) Xeon(R) Processor code named Cascadelake
        Frequency: 2.394 GHz
        Logical CPU Count: 96
        Max DRAM Single-Package Bandwidth: 110.000 GB/s