I think a quite common doubt of VTune users is: What is the difference between "MEM_LOAD_RETIRED.L2_LINE_MISS" and "L2_LINES_IN.SELF.ANY" and which conclusions we can draw from each of these metrics?
Let me explain better. I have two different systems, say A and B, and run the same workload, on top of the same HW, for both (HW prefetching and adjacent cacheline were both disabled). My goal is to find out which of them has the most optimized memory access, so I measured a couple of memory-related metrics. The results are shown below:
System A MEM_LOAD_RETIRED.L2_LINE_MISS events = 12,226,200 L2_LINES_IN.SELF.ANY events = 667,740,416 RS_UOPS_DISPATCHED events = 83,848,935,312 RS_UOPS_DISPATCHED.CYCLES_NONE events = 63,349,782,300 L2 Cache Miss Rate = 0.004 -> from which we obtain: * Total stall time = 43,04% * Stall time due to L2 misses = 2,49% (assuming a 300 cycles miss penalty)
System B MEM_LOAD_RETIRED.L2_LINE_MISS events = 78,123,771 L2_LINES_IN.SELF.ANY events = 521,005,131 RS_UOPS_DISPATCHED events = 94,765,782,112 RS_UOPS_DISPATCHED.CYCLES_NONE events = 51,910,907,586 L2 Cache Miss Rate = 0.003 -> from which we obtain: * Total stall time = 35,39% * Stall time due to L2 misses = 15,98% (assuming a 300 cycles miss penalty)
So, if we look at "L2_LINES_IN" and "L2 Cache Miss Rate", system B seems to be less affected by L2 cache misses than system A. However, if we look at "L2 miss stall time" (derived from MEM_LOAD_RETIRED.L2_LINE_MISS) the conclusion is totally the opposite (2,49% L2 miss impact for A, against 15,98% for B). So, my question is: what can I conclude from these numbers that VTune reported?