Solved: Performance counters for measuring L2 and L3 Hit ratios

NStoj · ‎10-31-2019

Hello all,

I need some verification/clarification if functions that I'm using to measure L2 and L3 cache hit ratios are indeed used properly. The system on which we are measuring PMCs is based on Intel Xeon D 1500 series.

Perf can't be used in our system unfortunately as we don't have Linux OS running on CPU.

void setup_pmc(void){
    write_msr(IA32_PERF_GLOBAL_CTRL,         0x0);
    write_msr(IA32_FIXED_CTR_CTRL,         0x333);
    write_msr(IA32_PMC0,                     0x0);

    /* Event: 2EH Umask: 41H => Accesses to the LLC in which the data is not present */
    write_msr(IA32_PERFEVTSEL0,         0x43412e); 
    write_msr(IA32_PMC1,                     0x0);

    /* Event: D2H Umask: 08H => Retired load instructions which data sources were hits in 
    L3 without snoops required. */
    write_msr(IA32_PERFEVTSEL1,         0x4308d2);
    write_msr(IA32_PMC2,                     0x0);

    /* Event: D2H Umask: 04H => Retired load instructions which data sources were HitM 
    responses from shared L3 */
    write_msr(IA32_PERFEVTSEL2,         0x4304d2);
    write_msr(IA32_PMC3,                     0x0);

    /* Event: D1H Umask: 02H => Retired load instructions with L2 cache hits as data 
    sources. */
    write_msr(IA32_PERFEVTSEL3,         0x4302d1);
    write_msr(IA32_PERF_GLOBAL_CTRL, 0x70000000f);
}

This is the formula used to measure hits and ratios:

PCM0: L3Miss
PCM1: L3UnsharedHit
PCM2: LLCHitM
PCM3: L2Hit

Calculation formulas/functions:
L2 cache hit ratio:
uint64 all = L2Hit + LLCHitM+ L3UnsharedHit + L3Miss;
if (all) return (double)(L2Hit) / (double)(all);

L3 cache hit ratio:
uint64 hits = L3UnsharedHit + LLCHitM;
uint64 all = LLCHitM+ L3UnsharedHit + L3Miss;
if (all) return (double)(hits) / (double)(all);

getL3CacheMisses:
return L3Miss

getL2CacheMisses:
return LLCHitM+ L3UnsharedHit + L3Miss;

getL3CacheHitsNoSnoop:
return L3UnsharedHit

getL3CacheHitsSnoop:
return LLCHitM

getL3CacheHits:
return LLCHitM+ L3UnsharedHit

Thank you!

Br,

Nikola

McCalpinJohn · ‎11-01-2019

The events that are mostly likely to cause the core to stall are demand loads that miss in the first levels of the cache. HW prefetches are expected to miss, so combining the two types can be confusing if you are looking for causes of stalls. (Combining all the types is required if you are trying to measure bulk traffic, which is the relevant measure if you think that the code is bandwidth-limited at some level of the memory hierarchy.)

There have been a number of forum discussions on using performance counters to help understand memory-hierarchy-related "stalls". Some discussion and links are at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733

The event CYCLE_ACTIVITY.STALLS_TOTAL (Event 0xA3, Umask 0x04) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit. This count should be compared to the total cycle count of the job, and if it is small, then you can declare victory and move on. If it is not small, then some variants can help understand the role of various levels of the cache in these stalls:

CYCLE_ACTIVITY.STALLS_L1D_PENDING (Event 0xA3, Umask 0x08) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit *and* there is at least one demand load that is waiting on an L1 cache miss.
CYCLE_ACTIVITY.STALLS_L2_PENDING (Event 0xA3, Umask 0x05) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit *and* there is at least one demand load that is waiting on an L2 cache miss.
CYCLE_ACTIVITY.STALLS_LDM_PENDING (Event 0xA3, Umask 0x06) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit *and* there is at least one demand load that has not completed.

These events measure correlation, not causation, but the longer the load latency, the bigger the probability that any coincident stalls were actually caused by the latency of the memory access. The "STALLS_L1D_PENDING" event is usually only a little bit larger than the "STALLS_L2_PENDING" event -- the core can usually find work to do during an L1Miss/L2Hit event. These events have changed over processor generations, so the details may require some testing to confirm. NOTE that these events require setting the CMASK field in the PerfEvtSel register to the same value as the Umask -- this is a special "trick" that they use to enable the counter to operate with logical AND functionality (rather than the logical OR used by the standard Umask mechanism).

View solution in original post

McCalpinJohn · ‎10-31-2019

Many topics here.....

Generally speaking, you are better off if you can program the performance counters yourself, since that is the only way you can actually control what is happening or find any documentation.
- BUT, since there is no "virtualization" of the counters, you absolutely must pin any threads reading the performance counters. Otherwise it makes no sense to compute differences between "before" and "after" counts.
The 0xD1 MEM_LOAD_UOPS_RETIRED and 0xD2 MEM_LOAD_UOPS_L3_HIT_RETIRED events only increment for *loads* (not stores or hardware prefetches) that hit in selected levels of the cache.
- These two events have bugs in many implementations -- including Xeon D 1500.
- These are documented in the "Intel® Xeon® Processor D-1500, Intel® Xeon® Processor D-1500 NS, and Intel® Xeon® Processor D-1600 NS Product Families Specification Update", document 332054-019, August 2019.
- Errata BDE103 says that events 0xD1, 0xD2, 0xD3, and 0xCD may undercount by as much as 20%.
The 0x2E LONGEST_LATENCY_CACHE events are "architectural" -- meaning that there is an attempt to make the meaning "consistent" across generations. Because the underlying implementation can change, the meaning of the event on a specific model is not always clear.
- Search for "2EH" in Chapter 19 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384) and you will see how the definition of the event varies across processor models.
- Generally this event counts accesses due to core-originated accesses that miss in the L2, including instruction fetches, data loads, data stores, L1 HW prefetches. On most platforms it does not count L2 HW prefetches (because they are not "core-originated"), but L2 HW prefetches are explicitly included the documentation for the event SKX/CLX, and I can't figure out how both statements can be true. There are other possible access mechanisms, such as HW TLB walks, that are not mentioned in the description of this event on any platform.
As you might guess, I don't recommend using either of the performance counter event families listed above.
The events that you want will depend on what it is you want to measure.
- Measuring *traffic* is different than measuring misses, because traffic includes data motion due to instruction fetches, loads, stores, cache writebacks, TLB walks, hardware prefetches, and (on some platforms) IO.
- Cache hit/miss rates for load operations are usually (but not always) more important than cache hit/miss rates for store operations.
For the L2, I recommend:
- Event 0x24 L2_RQSTS has UMask values that allow you to count L2 hits/misses/accesses for code reads, demand loads, demand stores, and L2 HW prefetch operations. There are 8 UMask values documented in section 19.6 of Volume 3 of the Intel Architectures SW Developer's Manual, and more variations are documented at https://download.01.org/perfmon/BDW/broadwell_core_v25.json
The L3 is trickier because it is cut up into "slices", with addresses distributed around the slices using an undocumented hash function.
- The L3 has its own "uncore" performance monitoring infrastructure, documented in https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-d-1500-uncore-performance-monitoring.html
- There is a lot of complicated stuff in there, but the events LLC_LOOKUP and LLC_VICTIMS are probably enough to answer your hit rate questions. Note that accesses from each core are distributed across all of the LLC slices, so extra effort is required to attribute counts to specific cores. Using the "Cn_MSR_PMON_BOX_FILTER0" register you can select the cache state(s) that you want to count, and can also limit counting to events relating to a specific core (when that makes sense).
I almost always track traffic at the memory controller as well -- described in Section 2.5 of the Xeon D uncore performance monitoring guide.
- The IMC performance counters tend to be very reliable and unambiguous, so if there is a difference between the IMC CAS_COUNT.RD count and the L3 miss count, I tend to believe the IMC event.

NStoj · ‎11-01-2019

Thanks for detailed reply John!

Yep, we have application pinned to CPU Core where we are measuring counters.

The reason why I'm using PMCs to look at cache, is to be able to see how the application is using or not using the cache in the best way. Could you help to identify which 0x24 L2_RQSTS would be the best in this scenario ? I was thinking maybe to use L2_RQSTS.REFERENCES and L2_RQSTS.MISS, dividing those should give me ratio, right ?

Thanks for pointing out L3 reference document. For L3 ussage ratio, I guess for LLC_LOOKUP (Event 0x34) umask ANY would be ok, but for LLC_VICTIMS (Event 0x37) I'm not sure which umask to pick ?

Thanks!

McCalpinJohn · ‎11-01-2019

The events that are mostly likely to cause the core to stall are demand loads that miss in the first levels of the cache. HW prefetches are expected to miss, so combining the two types can be confusing if you are looking for causes of stalls. (Combining all the types is required if you are trying to measure bulk traffic, which is the relevant measure if you think that the code is bandwidth-limited at some level of the memory hierarchy.)

There have been a number of forum discussions on using performance counters to help understand memory-hierarchy-related "stalls". Some discussion and links are at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733

The event CYCLE_ACTIVITY.STALLS_TOTAL (Event 0xA3, Umask 0x04) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit. This count should be compared to the total cycle count of the job, and if it is small, then you can declare victory and move on. If it is not small, then some variants can help understand the role of various levels of the cache in these stalls:

CYCLE_ACTIVITY.STALLS_L1D_PENDING (Event 0xA3, Umask 0x08) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit *and* there is at least one demand load that is waiting on an L1 cache miss.
CYCLE_ACTIVITY.STALLS_L2_PENDING (Event 0xA3, Umask 0x05) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit *and* there is at least one demand load that is waiting on an L2 cache miss.
CYCLE_ACTIVITY.STALLS_LDM_PENDING (Event 0xA3, Umask 0x06) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit *and* there is at least one demand load that has not completed.

These events measure correlation, not causation, but the longer the load latency, the bigger the probability that any coincident stalls were actually caused by the latency of the memory access. The "STALLS_L1D_PENDING" event is usually only a little bit larger than the "STALLS_L2_PENDING" event -- the core can usually find work to do during an L1Miss/L2Hit event. These events have changed over processor generations, so the details may require some testing to confirm. NOTE that these events require setting the CMASK field in the PerfEvtSel register to the same value as the Umask -- this is a special "trick" that they use to enable the counter to operate with logical AND functionality (rather than the logical OR used by the standard Umask mechanism).