Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Sakura
Beginner
551 Views

VTune counting cache hit/miss wrong?

Jump to solution

Hi!

I am using VTune to measure the different levels of cache hits and misses (Load). I assumed L2_MISS = L3_HIT + L3_MISS (similarly for L1 and L2) but this does not seem to satisfy from the output below?

Config : Intel Core i3-5005u + Windows 10

CPU
    Name:    Intel(R) Core(TM) Processor code named Broadwell
    Frequency:    2.0 GHz
    Logical CPU Count:    4

Elapsed Time:    60.004s
    CPU Time:    25.576s
    CPI Rate:    1.641
    Total Thread Count:    4
    Paused Time:    0s

 

Hardware Events
    Hardware Event Type    Hardware Event Count    Hardware Event Sample Count    Events Per Sample
    BACLEARS.ANY    223,106,693    97    100003
    BR_MISP_RETIRED.ALL_BRANCHES_PS    64,401,449    7    400009
    CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE    1,497,344,919    651    100003
    CPU_CLK_UNHALTED.REF_TSC    51,034,000,000    25,517    2000000
    CPU_CLK_UNHALTED.REF_XCLK    2,645,079,350    1,150    100003
    CPU_CLK_UNHALTED.THREAD    51,314,000,000    25,657    2000000
    CPU_CLK_UNHALTED.THREAD_P    47,242,070,863    1,027    2000003
    CYCLE_ACTIVITY.STALLS_L1D_MISS    13,616,020,424    296    2000003
    CYCLE_ACTIVITY.STALLS_L2_MISS    10,350,015,525    225    2000003
    CYCLE_ACTIVITY.STALLS_MEM_ANY    20,332,030,498    442    2000003
    CYCLE_ACTIVITY.STALLS_TOTAL    29,992,044,988    652    2000003
    INST_RETIRED.ANY    31,262,000,000    15,631    2000000
    INST_RETIRED.PREC_DIST    30,130,045,195    655    2000003
    INST_RETIRED.X87    0    0    2000003
    INT_MISC.RECOVERY_CYCLES    276,000,414    6    2000003
    ITLB_MISSES.STLB_HIT    50,601,518    22    100003
    ITLB_MISSES.WALK_COMPLETED    85,102,553    37    100003
    ITLB_MISSES.WALK_DURATION    2,884,286,526    1,254    100003
    L1D.REPLACEMENT    1,518,002,277    33    2000003
    L1D_PEND_MISS.FB_FULL    46,000,069    1    2000003
    L1D_PEND_MISS.PENDING    33,810,050,715    735    2000003
    L2_RQSTS.RFO_HIT    55,200,828    12    200003
    LD_BLOCKS.NO_SR    0    0    100003
    LD_BLOCKS.STORE_FORWARD    39,101,173    17    100003
    LD_BLOCKS_PARTIAL.ADDRESS_ALIAS    71,302,139    31    100003
    LSD.CYCLES_4_UOPS    138,000,207    3    2000003
    LSD.CYCLES_ACTIVE    92,000,138    2    2000003
    LSD.UOPS    506,000,759    11    2000003
    MACHINE_CLEARS.COUNT    2,300,069    1    100003
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS    27,154,927    59    20011
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS    10,585,819    23    20011
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS    5,523,036    12    20011
    MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS    565,816,974    246    100003
    MEM_LOAD_UOPS_RETIRED.L1_HIT_PS    6,716,010,074    146    2000003
    MEM_LOAD_UOPS_RETIRED.L1_MISS_PS    761,322,839    331    100003
    MEM_LOAD_UOPS_RETIRED.L2_HIT_PS    434,713,041    189    100003
    MEM_LOAD_UOPS_RETIRED.L2_MISS_PS    332,489,587    289    50021
    MEM_LOAD_UOPS_RETIRED.L3_HIT_PS    287,620,750    250    50021
    MEM_LOAD_UOPS_RETIRED.L3_MISS    9,200,644    4    100007
    MEM_LOAD_UOPS_RETIRED.L3_MISS_PS    6,900,483    3    100007
    MEM_UOPS_RETIRED.ALL_STORES_PS    5,888,008,832    128    2000003
    MEM_UOPS_RETIRED.LOCK_LOADS_PS    262,218,354    114    100007
    MEM_UOPS_RETIRED.SPLIT_LOADS_PS    4,600,138    2    100003
    MEM_UOPS_RETIRED.SPLIT_STORES_PS    0    0    100003
    MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS    108,103,243    47    100003
    MEM_UOPS_RETIRED.STLB_MISS_STORES_PS    2,300,069    1    100003
 

Any help regarding this would be appreciated.

Thanks!

0 Kudos
1 Solution
McCalpinJohn
Black Belt
551 Views

The estimates are still only good to about 25% with 4 "hardware event sample counts".

Multiplexing the counters over this many different counter sets adds a level of uncertainty that cannot easily be quantified.

If you restrict the counters to a single set that captures the values that you are trying to compare, the results should be reliable enough for you to decide whether the counts are consistent.  You only need three counters for this test:

  •  MEM_LOAD_UOPS_RETIRED.L2_MISS 
  • MEM_LOAD_UOPS_RETIRED.L3_HIT
  • MEM_LOAD_UOPS_RETIRED.L3_MISS 

 

View solution in original post

9 Replies
ArunJ_Intel
Moderator
551 Views

Hi Sakura,

 

Could you please share the vtune results with us so we can take a look into the issue about cache hits and misses.

 

Arun Jose

Dmitry_R_Intel1
Employee
551 Views

I have a few suggestions:

1. Please use 'Limit PMU collection to counting' option to improve the accuracy

2. Please try to disable hardware prefetchers (through BIOS or MSR as described here: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processo... if possible. The MEM_LOAD_UOPS_RETIRED events accound only for demand loads and if data was brought by prefetcher they won't increment

McCalpinJohn
Black Belt
551 Views

It is important to understand that VTune uses a sampling methodology and that VTune is multiplexing the counters across the various events.

If you look at the third column ("Hardware Event Sample Count") for MEM_LOAD_UOPS_RETIRED.L3_MISS and MEM_LOAD_UOPS_RETIRED.L3_MISS_PS, you will see that those events were only counted 4 times and 3 times, respectively.  The "Hardware Event Count" in column 2 is not directly measured -- it is a scaled estimate based on the "Hardware Event Sample Count", the "Events Per Sample" value, and the fraction of the execution time during which each performance counter event was active.

 

Using the same L3_MISS events as an example: 

MEM_LOAD_UOPS_RETIRED.L3_MISS    9,200,644    4    100007
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS    6,900,483    3    100007

dividing the "Hardware Event Count" by the "Hardware Event Sample Count" and then dividing by the "Events Per Sample" value gives exactly 23.  This suggests that VTune was multiplexing 23 different performance counter event sets, and that each set was only being measured (approximately) 1/23rd of the time.  Each of the "Hardware Event Counts" should be interpreted as having a relative uncertainty of (at least) 1/(Hardware Event Sample Count) -- i.e., 25% for MEM_LOAD_UOPS_RETIRED.L3_MISS and 33% for MEM_LOAD_UOPS_RETIRED.L3_MISS_PS.

If you want more precise estimates, you should limit the sampling to a much smaller number of counters.  
The most precise numbers come from measuring a single set of events for the full duration of the program, rather than using a sampling methodology.

It is also true that the MEM_LOAD_UOPS_RETIRED events only count accesses due to demand loads, and not those due to activity of the L2 HW prefetchers.  When the prefetchers are working well the L2 and L3 cache miss counts can be reduced substantially.   This makes these events good for finding loads that don't get their data prefetched (and therefore have a much higher chance of causing stalls), but not good for estimating the total amount of traffic through the cache hierarchy.  The L2_RQSTS events and the OFFCORE_RESPONSE events are more useful for getting an idea of the total traffic for various transaction types at each level of the cache hierarchy.

Sakura
Beginner
551 Views

I tried disabling the H/W Prefetchers using PCM

#include <iostream>
#include <bitset>
#include <vector>

#include "cpucounters.h"
#include "msr.h"

constexpr uint64 MSR_NUM = 0x1A4U;

int main(int argc, const char *argv[]) {
    PCM *m = PCM::getInstance();
    if (m->program() != PCM::Success) {
        std::cout << "Failed to init PCM" << std::endl;
        return 1;
    }

    std::vector<std::shared_ptr<SafeMsrHandle>> MSR;

    for (int i = 0; i < m->getNumCores(); ++i) {
        if (m->isCoreOnline(int32(i))) {
            MSR.push_back(std::make_shared<SafeMsrHandle>(i));
        } else {
            MSR.push_back(std::make_shared<SafeMsrHandle>());
        }
    }

    uint64 val = 0x0FU;

    //Read MSR Value
    for (auto &msr : MSR) {
        if (!(msr->read(MSR_NUM, &val))) {
            std::cout << msr->getCoreId() << " error while read" << std::endl;
        };
        std::cout << "Core : " << msr->getCoreId() << " \t "
                  << "MSR Value : " << std::bitset<64>(val) << std::endl;
    }

    //Write MSR Value
    val = 0x0FU;    //Disable H/W Prefetcher

    for (auto &msr : MSR) {
        if (!msr->write(MSR_NUM, val)) {
            std::cout << "error writing to MSR of core : " << msr->getCoreId() << std::endl;
        };
    }

    std::cout << std::endl;

    //Read MSR Value
    for (auto &msr : MSR) {
        if (!(msr->read(MSR_NUM, &val))) {
            std::cout << msr->getCoreId() << " error while reading" << std::endl;
        };
        std::cout << "Core : " << msr->getCoreId() << " \t "
                  << "MSR Value : " << std::bitset<64>(val) << std::endl;
    }

    m->resetPMU();

    return 0;
}

 

Config #1 : Disabled H/W Prefetcher and Enabled 'Limit PMU collection to counting'

Hardware Events
    Hardware Event Type    Hardware Event Count    Hardware Event Sample Count    Events Per Sample    Precise
    BACLEARS.ANY    557,855,130    4    [Unknown]    
    BR_MISP_RETIRED.ALL_BRANCHES_PS    216,106,360    4    [Unknown]    
    CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE    7,621,940,670    4    [Unknown]    
    CPU_CLK_UNHALTED.REF_TSC    181,953,469,200    4    [Unknown]    
    CPU_CLK_UNHALTED.REF_XCLK    9,097,798,600    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_L1D_MISS    26,758,381,920    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_L2_MISS    24,291,134,900    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_MEM_ANY    57,794,419,100    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_TOTAL    87,167,607,370    4    [Unknown]    
    ICACHE.IFDATA_STALL    15,173,609,210    4    [Unknown]    
    INST_RETIRED.ANY    183,617,012,740    4    [Unknown]    
    L1D_PEND_MISS.FB_FULL    463,431,990    4    [Unknown]    
    L1D_PEND_MISS.PENDING    70,342,063,340    4    [Unknown]    
    L2_RQSTS.ALL_CODE_RD    5,923,272,380    4    [Unknown]    
    L2_RQSTS.ALL_DEMAND_DATA_RD    3,577,095,470    4    [Unknown]    
    L2_RQSTS.ALL_DEMAND_MISS    3,304,622,310    4    [Unknown]    
    L2_RQSTS.ALL_DEMAND_REFERENCES    10,677,066,390    4    [Unknown]    
    L2_RQSTS.ALL_PF    15,687,420    4    [Unknown]    
    L2_RQSTS.ALL_RFO    1,151,078,890    4    [Unknown]    
    L2_RQSTS.DEMAND_DATA_RD_HIT    2,447,918,870    4    [Unknown]    
    L2_RQSTS.L2_PF_HIT    0    4    [Unknown]    
    L2_RQSTS.L2_PF_MISS    0    4    [Unknown]    
    L2_RQSTS.MISS    3,365,589,580    4    [Unknown]    
    L2_RQSTS.RFO_HIT    533,051,550    4    [Unknown]    
    L2_RQSTS.RFO_MISS    629,434,250    4    [Unknown]    
    MACHINE_CLEARS.COUNT    16,960,030    4    [Unknown]    
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS    126,786,610    4    [Unknown]    
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS    10,112,060    4    [Unknown]    
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS    4,269,390    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS    2,061,084,670    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L1_HIT    38,064,387,780    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L1_MISS    2,392,083,810    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L2_HIT    1,752,622,830    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L2_MISS    639,460,980    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_HIT    431,623,490    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_MISS    51,383,760    4    [Unknown]    
    MEM_UOPS_RETIRED.ALL_STORES_PS    28,383,863,750    4    [Unknown]    
    MEM_UOPS_RETIRED.LOCK_LOADS_PS    825,732,500    4    [Unknown]    
    MEM_UOPS_RETIRED.SPLIT_LOADS_PS    52,038,990    4    [Unknown]    
    MEM_UOPS_RETIRED.SPLIT_STORES_PS    21,278,130    4    [Unknown]    
    MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS    373,317,710    4    [Unknown]    
    MEM_UOPS_RETIRED.STLB_MISS_STORES_PS    61,491,230    4    [Unknown]    
 

Config #2 : Disabled H/W Prefetcher and sampling mode

Hardware Events
    Hardware Event Type    Hardware Event Count    Hardware Event Sample Count    Events Per Sample    Precise
    BACLEARS.ANY    228,006,840    190    100003    False
    BR_MISP_RETIRED.ALL_BRANCHES_PS    62,401,404    13    400009    True
    CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE    1,671,650,148    1,393    100003    False
    CPU_CLK_UNHALTED.REF_TSC    50,250,000,000    25,125    2000000    False
    CPU_CLK_UNHALTED.REF_XCLK    2,530,875,924    2,109    100003    False
    CYCLE_ACTIVITY.STALLS_L1D_MISS    9,120,013,680    380    2000003    False
    CYCLE_ACTIVITY.STALLS_L2_MISS    8,304,012,456    346    2000003    False
    CYCLE_ACTIVITY.STALLS_MEM_ANY    18,360,027,540    765    2000003    False
    CYCLE_ACTIVITY.STALLS_TOTAL    25,632,038,448    1,068    2000003    False
    ICACHE.IFDATA_STALL    6,288,009,432    262    2000003    False
    INST_RETIRED.ANY    47,268,000,000    23,634    2000000    False
    L1D_PEND_MISS.FB_FULL    72,000,108    3    2000003    False
    L1D_PEND_MISS.PENDING    22,992,034,488    958    2000003    False
    L2_RQSTS.ALL_CODE_RD    2,172,032,580    905    200003    False
    L2_RQSTS.ALL_DEMAND_DATA_RD    1,288,819,332    537    200003    False
    L2_RQSTS.ALL_DEMAND_MISS    1,353,620,304    564    200003    False
    L2_RQSTS.ALL_DEMAND_REFERENCES    3,832,857,492    1,597    200003    False
    L2_RQSTS.ALL_PF    2,400,036    1    200003    False
    L2_RQSTS.ALL_RFO    388,805,832    162    200003    False
    L2_RQSTS.DEMAND_DATA_RD_HIT    852,012,780    355    200003    False
    L2_RQSTS.L2_PF_HIT    0    0    200003    False
    L2_RQSTS.L2_PF_MISS    0    0    200003    False
    L2_RQSTS.MISS    1,272,019,080    530    200003    False
    L2_RQSTS.RFO_HIT    172,802,592    72    200003    False
    L2_RQSTS.RFO_MISS    189,602,844    79    200003    False
    MACHINE_CLEARS.COUNT    2,400,072    2    100003    False
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS    39,621,780    165    20011    True
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS    3,361,848    14    20011    True
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS    960,528    4    20011    True
    MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS    840,025,200    700    100003    True
    MEM_LOAD_UOPS_RETIRED.L1_HIT    10,704,016,056    446    2000003    False
    MEM_LOAD_UOPS_RETIRED.L1_HIT_PS    10,656,015,984    444    2000003    True
    MEM_LOAD_UOPS_RETIRED.L1_MISS    852,025,560    710    100003    False
    MEM_LOAD_UOPS_RETIRED.L1_MISS_PS    852,025,560    710    100003    True
    MEM_LOAD_UOPS_RETIRED.L2_HIT    615,618,468    513    100003    False
    MEM_LOAD_UOPS_RETIRED.L2_HIT_PS    618,018,540    515    100003    True
    MEM_LOAD_UOPS_RETIRED.L2_MISS    217,291,224    362    50021    False
    MEM_LOAD_UOPS_RETIRED.L2_MISS_PS    217,291,224    362    50021    True
    MEM_LOAD_UOPS_RETIRED.L3_HIT    180,075,600    300    50021    False
    MEM_LOAD_UOPS_RETIRED.L3_HIT_PS    180,075,600    300    50021    True
    MEM_LOAD_UOPS_RETIRED.L3_MISS    6,000,420    5    100007    False
    MEM_LOAD_UOPS_RETIRED.L3_MISS_PS    6,000,420    5    100007    True
    MEM_UOPS_RETIRED.ALL_STORES_PS    7,296,010,944    304    2000003    True
    MEM_UOPS_RETIRED.LOCK_LOADS_PS    261,618,312    218    100007    True
    MEM_UOPS_RETIRED.SPLIT_LOADS_PS    10,800,324    9    100003    True
    MEM_UOPS_RETIRED.SPLIT_STORES_PS    0    0    100003    True
    MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS    127,203,816    106    100003    True
    MEM_UOPS_RETIRED.STLB_MISS_STORES_PS    18,000,540    15    100003    True
 

 

Config #3: Enabled H/W Prefetcher and 'Limit PMU collection to counting'

Hardware Events
    Hardware Event Type    Hardware Event Count    Hardware Event Sample Count    Events Per Sample    Precise
    BACLEARS.ANY    695,117,540    4    [Unknown]    
    BR_MISP_RETIRED.ALL_BRANCHES_PS    292,146,210    4    [Unknown]    
    CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE    7,303,358,670    4    [Unknown]    
    CPU_CLK_UNHALTED.REF_TSC    206,639,106,600    4    [Unknown]    
    CPU_CLK_UNHALTED.REF_XCLK    10,331,982,120    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_L1D_MISS    41,605,016,090    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_L2_MISS    37,148,777,030    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_MEM_ANY    68,923,672,000    4    [Unknown]    
    CYCLE_ACTIVITY.STALLS_TOTAL    105,928,596,800    4    [Unknown]    
    ICACHE.IFDATA_STALL    30,367,994,640    4    [Unknown]    
    INST_RETIRED.ANY    189,024,991,010    4    [Unknown]    
    L1D_PEND_MISS.FB_FULL    484,582,810    4    [Unknown]    
    L1D_PEND_MISS.PENDING    106,322,399,380    4    [Unknown]    
    L2_RQSTS.ALL_CODE_RD    6,014,740,260    4    [Unknown]    
    L2_RQSTS.ALL_DEMAND_DATA_RD    3,186,052,550    4    [Unknown]    
    L2_RQSTS.ALL_DEMAND_MISS    4,528,008,390    4    [Unknown]    
    L2_RQSTS.ALL_DEMAND_REFERENCES    10,255,841,370    4    [Unknown]    
    L2_RQSTS.ALL_PF    11,375,132,880    4    [Unknown]    
    L2_RQSTS.ALL_RFO    1,048,756,160    4    [Unknown]    
    L2_RQSTS.DEMAND_DATA_RD_HIT    1,575,681,790    4    [Unknown]    
    L2_RQSTS.L2_PF_HIT    3,468,292,610    4    [Unknown]    
    L2_RQSTS.L2_PF_MISS    7,503,845,180    4    [Unknown]    
    L2_RQSTS.MISS    12,229,547,290    4    [Unknown]    
    L2_RQSTS.RFO_HIT    565,553,480    4    [Unknown]    
    L2_RQSTS.RFO_MISS    481,125,090    4    [Unknown]    
    MACHINE_CLEARS.COUNT    18,138,280    4    [Unknown]    
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS    90,361,080    4    [Unknown]    
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS    35,474,800    4    [Unknown]    
    MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS    16,199,850    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS    1,885,967,180    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L1_HIT    40,356,067,870    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L1_MISS    2,157,009,930    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L2_HIT    1,142,602,610    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L2_MISS    1,014,407,320    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_HIT    790,380,810    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_MISS    79,202,220    4    [Unknown]    
    MEM_UOPS_RETIRED.ALL_STORES_PS    29,512,105,850    4    [Unknown]    
    MEM_UOPS_RETIRED.LOCK_LOADS_PS    800,299,390    4    [Unknown]    
    MEM_UOPS_RETIRED.SPLIT_LOADS_PS    63,558,820    4    [Unknown]    
    MEM_UOPS_RETIRED.SPLIT_STORES_PS    30,784,420    4    [Unknown]    
    MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS    259,247,280    4    [Unknown]    
    MEM_UOPS_RETIRED.STLB_MISS_STORES_PS    55,236,070    4    [Unknown]    
 

Even after disabling the H/W prefetcher,

(MEM_LOAD_UOPS_RETIRED.L2_MISS != MEM_LOAD_UOPS_RETIRED.L3.HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS).

In counting mode, L1_MISS equals L2_HIT + L2_MISS (exactly equal) and in sampling mode, they are roughly same, but L2 and L3 cache hit miss counts never satisfies the above equation. (MEM_LOAD_UOPS_RETIRED.L3_MISS is way too off)

Similar thing happens with 'Analysis in system wide mode'.

 

MEM_LOAD_UOPS_RETIRED.L3_MISS has a very low 'Hardware Event Sample Count', but even with that uncertainty, the count is off. 

McCalpinJohn
Black Belt
552 Views

The estimates are still only good to about 25% with 4 "hardware event sample counts".

Multiplexing the counters over this many different counter sets adds a level of uncertainty that cannot easily be quantified.

If you restrict the counters to a single set that captures the values that you are trying to compare, the results should be reliable enough for you to decide whether the counts are consistent.  You only need three counters for this test:

  •  MEM_LOAD_UOPS_RETIRED.L2_MISS 
  • MEM_LOAD_UOPS_RETIRED.L3_HIT
  • MEM_LOAD_UOPS_RETIRED.L3_MISS 

 

View solution in original post

Sakura
Beginner
551 Views

Limiting to only 3 counters does indeed improve the consistency, not accurate though, but much better than earlier. I guess I will resort to multiple runs for various events.

 

Config #1: Disable HW Prefetcher and Enable counting mode

Hardware Events
    Hardware Event Type    Hardware Event Count    Hardware Event Sample Count    Events Per Sample    Precise
    MEM_LOAD_UOPS_RETIRED.L2_MISS    967,014,729    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_HIT    650,476,577    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_MISS    191,093,546    4    [Unknown]    
 

Config #2 : Disable HW Prefetcher and Enable counting mode [PS Events]

Hardware Events
    Hardware Event Type    Hardware Event Count    Hardware Event Sample Count    Events Per Sample    Precise
    MEM_LOAD_UOPS_RETIRED.L2_MISS_PS    773,652,897    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_HIT_PS    521,977,568    4    [Unknown]    
    MEM_LOAD_UOPS_RETIRED.L3_MISS_PS    111,549,760    4    [Unknown]    
 

McCalpinJohn
Black Belt
551 Views

I don't know exactly what tools are available in Windows, but if you want to do arithmetic on counts, it is best to to use a tool that is designed to count and not to sample.

On Linux it is easy to use "perf stat" for whole-program counting.  

I just ran a few checks to see whether these counters agree for a simple benchmark.   I set up the STREAM benchmark with 200M double-precision elements per array and 10 iterations.   There are 6 variables loaded in each iteration, plus 1 read in the setup code and 3 reads in the validation code.  So for 10 iterations, I expect 64 loads of variables of type "double".  200M elements * 8 Bytes/element * 64 reads / 8 reads/cacheline = 1,600,000,000 cache line reads expected.   Since the arrays are big (1.5 GiB each), I expect essentially all these loads to miss in the L2 and in the L3.

With HW prefetch disabled (and running on one core), I get

$ perf stat -e mem_load_retired.l2_miss -e mem_load_retired.l3_hit -e mem_load_retired.l3_miss ./stream.runtime.COMMON-AVX512.alloc.10x

     1,601,323,903      mem_load_retired.l2_miss                                    
         8,342,415      mem_load_retired.l3_hit                                     
     1,592,966,125      mem_load_retired.l3_miss     

The sum of L3 hit and L3 miss divided by L2 misses is .9999904 -- good to 5 digits.

With HW prefetch re-enabled (still on one core), the number of hits and misses decreases by about a factor of 2:

$ perf stat -e mem_load_retired.l2_miss -e mem_load_retired.l3_hit -e mem_load_retired.l3_miss ./stream.runtime.COMMON-AVX512.alloc.10x
       779,846,893      mem_load_retired.l2_miss                                    
         4,895,328      mem_load_retired.l3_hit                                     
       774,940,945      mem_load_retired.l3_miss

Again, the sum of L3 hit and L3 miss is a very close match to L2 miss (.9999863819).   Almost exactly 1/2 of the L2 misses and almost exactly 1/2 of the L3 misses "disappear" because the hardware prefetcher is able to fetch the cache lines into the corresponding level of the cache before the load gets there.  In this case the HW prefetcher can't do much better because it restarts at the beginning of every 4KiB page, so it can't stay far enough "ahead" of the load stream(s).

If I run on all cores, the memory system gets busier (which increases latency, so the prefetchers are less effective at getting the data into the cache before the load arrives), and the number of L2 and L3 cache misses each increase slightly(about 3.4%):

       806,611,516      mem_load_retired.l2_miss                                    
         4,928,217      mem_load_retired.l3_hit                                     
       801,367,558      mem_load_retired.l3_miss           

With HW prefetch disabled and using all cores, the miss counts are a little bit (~3%) smaller than the expected values, and the sum of L3 hit and miss still matches the L2 misses to better than 4 digits.  (The 3% discrepancy may be in part due to the "next page prefetcher" which can't be disabled.  It would take some careful testing to try to understand the details.)

     1,552,949,519      mem_load_retired.l2_miss                                    
         7,549,747      mem_load_retired.l3_hit                                     
     1,545,250,092      mem_load_retired.l3_miss

Using smaller array sizes (e.g., STREAM_ARRAY_SIZE = 3,145,728 gives exactly 24.0 MiB/array), we get a much higher rate of L3 hits.  In the cases I tested, the L3 hit+miss count was about 3% lower than the L2 miss count, but it would take more detailed work to understand if that is significant.   

ArunJ_Intel
Moderator
551 Views

Hey Sakura.

 

Hope your issue is resolved. Could you please confirm if the solutions provided here helps or if there is anything else you need help with.

 

Thanks

Arun Jose

 

ArunJ_Intel
Moderator
551 Views

Hey Sakura., 

We are closing this case assuming the solution provided helps. Please feel free to raise a new thread in case of further  issues 

 

Thanks

Arun Jose

Reply