Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Get the last level cache miss

GHui
Novice
2,521 Views

I don't very clearly understand the following events take effect on CPU.

If I want to get the last level cache miss percent, which events I should use is better?

LONGEST_LAT_CACHE.REFERENCE
LONGEST_LAT_CACHE.MISS

MEM_LOAD_UOPS_RETIRED.LLC_HIT
MEM_LOAD_UOPS_RETIRED.LLC_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB

MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE

MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS

 

 

0 Kudos
11 Replies
Thomas_G_4
New Contributor II
2,521 Views

Hi,

The proper selection of events depends on your microarchitecture.

The two LONGEST_LAT_CACHE events are available since Core2 (I think it was Core2 that introduced the architectural events) but as far as my tests have shown, they only count reads to L3 but no write accesses.

The MEM_LOAD_UOPS_RETIRED events were listed in specification updates for some architectures to undercount. As far as I remember the architectures were SandyBridge, IvyBridge and Haswell. It should be fixed on Broadwell but I never tested it. As the name suggests, it counts only load operations.

The MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_* events are also listen in specification updates for some architectures to undercount. I think it are the same architectures affected as with MEM_LOAD_UOPS_RETIRED. Moreover, I'm not sure whether these events are what you are looking for. They count LLC hits where the cross-core (other core's L2) snoop was a hit (clean CL), hitm (dirty CL), miss and none (no snoop needed). Similar to the above events, it counts only load operations.

MEM_LOAD_UOPS_MISC_RETIRED counts load micro-ops where the data source is not addressable. This might be anything that is attached to the cache hierarchy. As the name suggests, it counts only load operations.

I would use the OFFCORE_RESPONSE events, they have shown the highest correlation with a LRU model. For Sandybridge the OFFCORE_RESPONSE filter configs should be anything like 0x10081 for all L3 accesses and 0x3fffc00081 for all L3 misses. See SDM or https://download.01.org/perfmon/ for the bit definitions. According to my tests, it counts for load and store operations.

 

0 Kudos
GHui
Novice
2,521 Views

There are OFFCORE_RESPONSE.* events. But their eventcode and umask are all the same.

https://download.01.org/perfmon/BDW/Broadwell_core_V17.json

https://download.01.org/perfmon/HSW/Haswell_core_V24.json

0 Kudos
GHui
Novice
2,521 Views

 

    >>as far as my tests have shown

I am interesting the test method. Thanks for your sharing.

0 Kudos
Thomas_G_4
New Contributor II
2,521 Views

The eventcodes and umasks for the two OFFCORE_RESPONSE.* events are the same on both architectures, that's true, but you have to specify what exactly should be counted in an extra register (MSR_OFFCORE_RSP_0 and MSR_OFFCORE_RSP_1). The meaning of the bits can be found in https://download.01.org/perfmon/BDW/Broadwell_matrix_bit_definitions_V17.json and https://download.01.org/perfmon/HSW/Haswell_matrix_bit_definitions_V24.json.

I'm using a memory copy benchmark (written in assembly) and measure with the LONGEST_LAT_CACHE.* events. At the first glimpse they seem accurate as the data volume (event_result*linesize) is completely in line with the data volume calculated by the benchmark. But for a memory copy like 'for i,n do a = b done', where 'a' was not used before, there are write-allocates/read-for-ownerships for each write operation. Consequently, for each CL write you also see a loaded CL and the measured and derived result should be 1.5 times of the value calculated by the benchmark using its input sizes (Read+Write+RFO instead of Read+Write).
You can also see that when you have a store-only benchmark, which should measure twice the data volume compared to the input vector size (Write+RFO instead of only Write).

0 Kudos
GHui
Novice
2,521 Views

How could I understand read, write, load, store, and RFO?

0 Kudos
GHui
Novice
2,521 Views
 

I have test the code on "Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz", which cache size is 15360 KB.

To use OFFCORE_RESPONSE events to calculate the Last Level Cache Miss. 
I think when I set array is 256000, there is no cache miss, but there is. 
If there is any mistake I took?

 

#include <stdio.h>
#include <string.h>
#define MAX_SIZE 256000
double data[MAX_SIZE];
//double from[MAX_SIZE];
//gcc a.c -g -fopenmp
int main(int argc,char **argv)
{
memset(data,0,MAX_SIZE*sizeof(double));
//memset(from,0,MAX_SIZE*sizeof(double));
int i;
while(1)
{
#pragma omp parallel for
for(i=0;i<MAX_SIZE;i++)
{
//data=from;
data=98.3;
}
}
}

 

The top info as following

top - 18:23:29 up  6:30,  4 users,  load average: 9.95, 9.23, 6.74
Tasks: 325 total,   2 running, 323 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.9 us,  0.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 57517004 total, 56440840 free,   700916 used,   375248 buff/cache
KiB Swap: 28901372 total, 28901372 free,        0 used. 56333992 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9177 root      20   0  199160   2860    540 R  2400  0.0  11:16.76 a.out

 

And the Last level cache miss as following

LLCM: 3045585.000000 / 12458452.000000 = 24.445934 %
LLCM: 3040797.000000 / 12449185.000000 = 24.425671 %
LLCM: 3046373.000000 / 12474775.000000 = 24.420264 %
LLCM: 3043498.000000 / 12467793.000000 = 24.410880 %
LLCM: 3037478.000000 / 12465265.000000 = 24.367537 %
LLCM: 3052085.000000 / 12479500.000000 = 24.456789 %
LLCM: 3046219.000000 / 12549413.000000 = 24.273797 %
LLCM: 3047199.000000 / 12548870.000000 = 24.282657 %
LLCM: 3042909.000000 / 12538601.000000 = 24.268329 %
LLCM: 3034829.000000 / 12514288.000000 = 24.250912 %
LLCM: 3045371.000000 / 12499838.000000 = 24.363284 %
LLCM: 3045459.000000 / 12524622.000000 = 24.315776 %
LLCM: 3044522.000000 / 12518422.000000 = 24.320334 %
LLCM: 3065239.000000 / 12461541.000000 = 24.597592 %
LLCM: 3527701.000000 / 18078440.000000 = 19.513304 %
LLCM: 2928598.000000 / 12288175.000000 = 23.832652 %
LLCM: 2920311.000000 / 12285334.000000 = 23.770709 %

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,521 Views

It is difficult to know what to expect here without knowing exactly which bits you programmed into the auxiliary MSRs for the OFFCORE_RESPONSE events....

In general, Intel performance counter events separate cache line traffic due to "demand" accesses from the traffic due to hardware prefetch accesses.  If one of the L2 Hardware Prefetchers brings the data into the L3 before the "demand" load (or store miss) access, then the "demand" load (or store miss) will count as an L3 hit -- even though the data actually came from memory.  (I.e., it was not in the L3 cache because of prior use, it was in the L3 cache because of the hardware prefetcher.)

The "Xeon E5-2620 (2.0 GHz)" processor is a Sandy Bridge EP, so the bit definitions for the OFFCORE_RESPONSE events are at https://download.01.org/perfmon/SNB/SandyBridge_matrix_bit_definitions_V15.json  ; This reference does not describe how to use the events -- this information is in Section 18.9.5 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384, revision 060, September 2016).  Not all of the bit combinations make sense, and not all of the bit combinations that do make sense actually count correctly.  Some examples of events specifically intended for the Sandy Bridge EP processors are in Table 19-15 near the end of Section 19.6 of Volume 3 of the Intel Architectures Software Developer's Manual.

Looking at these events makes it clear that the definition of "last-level cache miss" is not trivial.  Last-level cache accesses include many different kinds of operations: DEMAND_DATA_RD, DEMAND_RFO, DEMAND_CODE_RD, PF_L2_DATA_RD, PF_L2_RFO, PF_L2_CODE_RD, PF_LLC_DATA_RD, PF_LLC_RFO, PF_LLC_CODE_RD, and possibly "OTHER".  Many of these operations are not documented in detail, and require significant research to understand....  Even with significant research, it is not clear that it is possible to understand all of the details -- especially given the rather large number of bugs and idiosyncrasies in the counters.  

For this specific piece of code, all of the memory accesses in the main loop are stores, so if the compiler generates "normal" store instructions, the transactions generated should be overwhelmingly dominated by:

  • DEMAND_RFO: "Read For Ownership", i.e., a store that misses in the L2 cache,
  • PF_L2_RFO: an L2 hardware prefetch that brings a copy of the cache line into the L2 cache with write permission,
    • Note that because the L3 cache on this processor is inclusive, this prefetch will also leave a copy of the line in the L3.
  • PF_LLC_RFO: an L2 hardware prefetch that brings a copy of the cache line into the L3 cache with write permission.

Each of these can be measured independently by setting the appropriate bits in the auxiliary MSR(s) for the OFFCORE_RESPONSE event(s).

A useful tool in understanding the hardware performance counters is disabling the hardware prefetchers.  The procedure is described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors  ; Remember to re-enable the hardware prefetchers before you hand the system over to someone else, or they will see unexpected performance reductions on most codes.....

For this code, the Intel C compiler may generate "normal" store instructions or "streaming" store instructions.  If you need explicit control over this, you can use the "-qopt-streaming-stores never" or "-qopt-streaming-stores always" compiler flags, or you can use the "#pragma vector temporal" or "#pragma vector nontemporal" pragmas immediately before the main loop to control streaming stores for this loop alone.  

If the code is compiled with streaming stores there should not be any prefetches.

0 Kudos
GHui
Novice
2,521 Views

I set the bit as follow

First, 
    addr=0x38f;val=0x70000000f; // Global Ctrl
    pwrite(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);
    addr=0x1A6;val=0x3fffc00081; // all L3 misses
    pwrite(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);
    addr=0x1A7;val=0x10081; // all L3 accesses
    pwrite(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);
Second,
    addr=0x186;val=0x6301b7; // 
    pwrite(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);
    addr=0x187;val=0x6301bb; // 
    pwrite(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);
Last,
        addr=0xc1;
        pread(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);
        addr=0xc2;
        pread(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);    

0 Kudos
Thomas_G_4
New Contributor II
2,521 Views

Why do you start the counters before programming them?

addr=0x38f;val=0x70000000f; // Global Ctrl
pwrite(llcmiss->msr_fd,(void*)&val,sizeof(uint64),addr);

I would do that after all counters are programmed. Moreover, I would clear the counter registers 0xC1 and 0xC2 before any measurements.

Moreover, your event configuration might cause increments at cache misses that you are not expecting. The configuration 0x6301b7 translates into 'count in user-space', 'count in kernel-space', 'count the whole core, not just the current thread'. Since you run on each thread, you could leave out the 'ANY' bit. Furthermore, you want reliable measurements from user-space, so don't set the 'OS' bit. This results in the configuration 0x4101B7.

0 Kudos
GHui
Novice
2,521 Views

I change the code, start the counters after programming, and clear the 0xC1 and 0xC2, and set global ctrl is 0x4101B7.
And I change the paltform because of lack of Sandybridge.
I use the "Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz", CPUID signature is 06_3FH, which cache size is 35840 KB.

I set 0x1A6/0x1A7 are 0x3FB80083B3/0x183B3;

And test the following code, I think I get the wrong "Last Level Cache Miss".

#include <stdio.h>
#include <string.h>
#define MAX_SIZE 256000
//#define MAX_SIZE 128000000
double data[MAX_SIZE];
//double from[MAX_SIZE];
//gcc a.c -g -fopenmp
int main(int argc,char **argv)
{
memset(data,0,MAX_SIZE*sizeof(double));
//memset(from,0,MAX_SIZE*sizeof(double));
int i;
while(1)
{
//#pragma vector nontemporal
//#pragma vector temporal
#pragma omp parallel for
for(i=0;i<MAX_SIZE;i++)
{
//data=from;
data=98.3;
}
}
}

 

When I set MAX_SIZE is 256000,

LLCM: 3738864.000000 / 9641428.000000 = 38.779152 %
LLCM: 15992645.000000 / 36279441.000000 = 44.081840 %
LLCM: 16836404.000000 / 38185340.000000 = 44.091277 %
LLCM: 16862128.000000 / 38506794.000000 = 43.790008 %
LLCM: 16223989.000000 / 37761357.000000 = 42.964529 %
LLCM: 16770734.000000 / 37366655.000000 = 44.881550 %
LLCM: 16823011.000000 / 38083658.000000 = 44.173832 %
LLCM: 16894080.000000 / 38342697.000000 = 44.060750 %

 

When I set MAX_SIZE is 1000000,

LLCM: 2321497.000000 / 249327163.000000 = 0.931105 %
LLCM: 6367890.000000 / 710201213.000000 = 0.896632 %
LLCM: 6898418.000000 / 784127893.000000 = 0.879757 %
LLCM: 6117948.000000 / 759010308.000000 = 0.806043 %
LLCM: 5638300.000000 / 785839100.000000 = 0.717488 %
LLCM: 5633062.000000 / 783191002.000000 = 0.719245 %
LLCM: 5613904.000000 / 782288572.000000 = 0.717626 %

 

Whe I set MAX_SIZE is 128000000,

LLCM: 46675655.000000 / 135001381.000000 = 34.574206 %
LLCM: 51380415.000000 / 197569224.000000 = 26.006285 %
LLCM: 33915341.000000 / 238431325.000000 = 14.224365 %
LLCM: 20534692.000000 / 339244397.000000 = 6.053067 %
LLCM: 1896042.000000 / 366751607.000000 = 0.516983 %
LLCM: 1815386.000000 / 367106360.000000 = 0.494512 %
LLCM: 6534438.000000 / 352507724.000000 = 1.853701 %
LLCM: 15052420.000000 / 338103992.000000 = 4.452009 %
LLCM: 19739952.000000 / 356325224.000000 = 5.539869 %
LLCM: 19054206.000000 / 344769655.000000 = 5.526648 %
LLCM: 34206840.000000 / 342930400.000000 = 9.974864 %
LLCM: 30453021.000000 / 355745125.000000 = 8.560348 %

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,521 Views

I don't think that the auxiliary MSR 0x1a6 is configured to count what you think it is supposed to count....

From the tables at https://download.01.org/perfmon/HSW/Haswell_matrix_bit_definitions_V24.json, the value in MSR 0x1A6 looks like:

  • Bits 0:15 -- 0x0081 -->
    • bit 7 is set: PF_L3_DATA_RD -- HW prefetches that bring data into the L3,
    • bit 0 is set: DMND_DATA_RD
    • This does not include any RFO transactions, which should be the transactions dominating this loop.
  • Bits 30:16 -- 0xFFC0 --> bits 30:22 set.
    • This includes many reserved bits according to Table 18-50 of V3 of the SWDM and the web reference above. 
    • On the other hand, this does match the usage by Intel's VTune for L3 misses, so it may be OK.
  • Bits 37:31 -- 0x7F: ANY_SNOOP -- OK

It looks like the tool is running in the background at some fixed interval.  When trying to understand the counters I recommend that you read them before and after the test loop so that you can compare the measured counts to the expected counts.  With the hardware prefetchers disabled, I would expect:

  • MAX_SIZE=256000
    • 256000*8/64 = 32000 L3 accesses
    • 0 L3 misses
  • MAX_SIZE=1000000
    • 1000000*8/64 = 125000 L3 accesses
    • 0 L3 misses
  • MAX_SIZE=128,000,000
    • 128,000,000*8/64 = 16,000,000 L3 accesses
    • 16,000,000 L3 misses

Deviations from these naive expected values are where things start to get interesting....

0 Kudos
Reply