Hello all,
I need some verification/clarification if functions that I'm using to measure L2 and L3 cache hit ratios are indeed used properly. The system on which we are measuring PMCs is based on Intel Xeon D 1500 series.
Perf can't be used in our system unfortunately as we don't have Linux OS running on CPU.
void setup_pmc(void){ write_msr(IA32_PERF_GLOBAL_CTRL, 0x0); write_msr(IA32_FIXED_CTR_CTRL, 0x333); write_msr(IA32_PMC0, 0x0); /* Event: 2EH Umask: 41H => Accesses to the LLC in which the data is not present */ write_msr(IA32_PERFEVTSEL0, 0x43412e); write_msr(IA32_PMC1, 0x0); /* Event: D2H Umask: 08H => Retired load instructions which data sources were hits in L3 without snoops required. */ write_msr(IA32_PERFEVTSEL1, 0x4308d2); write_msr(IA32_PMC2, 0x0); /* Event: D2H Umask: 04H => Retired load instructions which data sources were HitM responses from shared L3 */ write_msr(IA32_PERFEVTSEL2, 0x4304d2); write_msr(IA32_PMC3, 0x0); /* Event: D1H Umask: 02H => Retired load instructions with L2 cache hits as data sources. */ write_msr(IA32_PERFEVTSEL3, 0x4302d1); write_msr(IA32_PERF_GLOBAL_CTRL, 0x70000000f); }
This is the formula used to measure hits and ratios:
PCM0: L3Miss
PCM1: L3UnsharedHit
PCM2: LLCHitM
PCM3: L2Hit
Calculation formulas/functions:
L2 cache hit ratio:
uint64 all = L2Hit + LLCHitM+ L3UnsharedHit + L3Miss;
if (all) return (double)(L2Hit) / (double)(all);
L3 cache hit ratio:
uint64 hits = L3UnsharedHit + LLCHitM;
uint64 all = LLCHitM+ L3UnsharedHit + L3Miss;
if (all) return (double)(hits) / (double)(all);
getL3CacheMisses:
return L3Miss
getL2CacheMisses:
return LLCHitM+ L3UnsharedHit + L3Miss;
getL3CacheHitsNoSnoop:
return L3UnsharedHit
getL3CacheHitsSnoop:
return LLCHitM
getL3CacheHits:
return LLCHitM+ L3UnsharedHit
Thank you!
Br,
Nikola
Link Copied
The events that are mostly likely to cause the core to stall are demand loads that miss in the first levels of the cache. HW prefetches are expected to miss, so combining the two types can be confusing if you are looking for causes of stalls. (Combining all the types is required if you are trying to measure bulk traffic, which is the relevant measure if you think that the code is bandwidth-limited at some level of the memory hierarchy.)
There have been a number of forum discussions on using performance counters to help understand memory-hierarchy-related "stalls". Some discussion and links are at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733
The event CYCLE_ACTIVITY.STALLS_TOTAL (Event 0xA3, Umask 0x04) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit. This count should be compared to the total cycle count of the job, and if it is small, then you can declare victory and move on. If it is not small, then some variants can help understand the role of various levels of the cache in these stalls:
These events measure correlation, not causation, but the longer the load latency, the bigger the probability that any coincident stalls were actually caused by the latency of the memory access. The "STALLS_L1D_PENDING" event is usually only a little bit larger than the "STALLS_L2_PENDING" event -- the core can usually find work to do during an L1Miss/L2Hit event. These events have changed over processor generations, so the details may require some testing to confirm. NOTE that these events require setting the CMASK field in the PerfEvtSel register to the same value as the Umask -- this is a special "trick" that they use to enable the counter to operate with logical AND functionality (rather than the logical OR used by the standard Umask mechanism).
Many topics here.....
Thanks for detailed reply John!
Yep, we have application pinned to CPU Core where we are measuring counters.
The reason why I'm using PMCs to look at cache, is to be able to see how the application is using or not using the cache in the best way. Could you help to identify which 0x24 L2_RQSTS would be the best in this scenario ? I was thinking maybe to use L2_RQSTS.REFERENCES and L2_RQSTS.MISS, dividing those should give me ratio, right ?
Thanks for pointing out L3 reference document. For L3 ussage ratio, I guess for LLC_LOOKUP (Event 0x34) umask ANY would be ok, but for LLC_VICTIMS (Event 0x37) I'm not sure which umask to pick ?
Thanks!
The events that are mostly likely to cause the core to stall are demand loads that miss in the first levels of the cache. HW prefetches are expected to miss, so combining the two types can be confusing if you are looking for causes of stalls. (Combining all the types is required if you are trying to measure bulk traffic, which is the relevant measure if you think that the code is bandwidth-limited at some level of the memory hierarchy.)
There have been a number of forum discussions on using performance counters to help understand memory-hierarchy-related "stalls". Some discussion and links are at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733
The event CYCLE_ACTIVITY.STALLS_TOTAL (Event 0xA3, Umask 0x04) gives the total number of cycles in which no Uop(s) are dispatched to any functional unit. This count should be compared to the total cycle count of the job, and if it is small, then you can declare victory and move on. If it is not small, then some variants can help understand the role of various levels of the cache in these stalls:
These events measure correlation, not causation, but the longer the load latency, the bigger the probability that any coincident stalls were actually caused by the latency of the memory access. The "STALLS_L1D_PENDING" event is usually only a little bit larger than the "STALLS_L2_PENDING" event -- the core can usually find work to do during an L1Miss/L2Hit event. These events have changed over processor generations, so the details may require some testing to confirm. NOTE that these events require setting the CMASK field in the PerfEvtSel register to the same value as the Umask -- this is a special "trick" that they use to enable the counter to operate with logical AND functionality (rather than the logical OR used by the standard Umask mechanism).
For more complete information about compiler optimizations, see our Optimization Notice.