Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Skylake 4 cores L3 miss rate



I'm trying to figure out how to measure L3 miss rate in this way L3_miss_rate=L3_miss/L2_miss where L2_miss=L3_HIT+L3_MISS. 

L2_miss count can be obtained using L2_LINES_IN (data plus instructions misses) event but there is no equivalent event for L3_miss. The workload that I study has very few instructions so the instruction misses can be ignored. 

To calculate L3_miss count I need to use MEM_LOAD_RETIRED.L3_MISS and add misses in L3 caused by store operations. For L3 store misses I tried to use this event OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_MISS_LOCAL_DRAM.ANY_SNOOP. The problem is that most of the times this counter has a higher value than L2_LINES_IN which contradicts L2_miss=L3_HIT+L3_MISS. 

Is there any other way to calculate L3 miss? (I have all prefatchers disabled)


0 Kudos
1 Reply
Black Belt

The "architectural" LLC counters (Event 0x2E, Umasks 0x41 and 0x4F) look like they count both demand load misses and demand store misses, but not misses caused by hardware prefetches.    Since you have the HW prefetchers disabled, this should be enough?

BTW, at least on Haswell it looks like disabling the four documented hardware prefetchers is not enough to disable the "next page prefetcher".  This is not a significant performance issue -- with the four documented hardware prefetchers disabled I see approximately one increment of the LOAD_HIT_PRE.HW_PF (Event 0x4C, Umask 0x02) event for each 4KiB page accessed.  For the large arrays I am working with, this is about the same as the DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK event, which is one of the reasons I suspect it is the next page prefetcher at work.

Hmmm... I see that in Table 19-3 (Skylake performance counter events) of revision 057 of Volume 3 of the SWDM, the event 0x4C, Umask 0x01 is named "LOAD_HIT_PRE.HW_PF", but the words and the encoding suggests that it should be "LOAD_HIT_PRE.SW_PF".   It is curious that Sandy Bridge, Ivy Bridge, and Haswell define both LOAD_HIT_PRE.SW_PF and LOAD_HIT_PRE.HW_PF, while the table for Broadwell lists only LOAD_HIT_PRE.HW_PF (umask 0x02), and the table for Skylake lists only LOAD_HIT_PRE.HW_PF, but with umask 0x02.  

0 Kudos