- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I think a quite common doubt of VTune users is: What is the difference between "MEM_LOAD_RETIRED.L2_LINE_MISS" and "L2_LINES_IN.SELF.ANY" and which conclusions we can draw from each of these metrics?
Let me explain better. I have two different systems, say A and B, and run the same workload, on top of the same HW, for both (HW prefetching and adjacent cacheline were both disabled). My goal is to find out which of them has the most optimized memory access, so I measured a couple of memory-related metrics. The results are shown below:
System A
MEM_LOAD_RETIRED.L2_LINE_MISS events = 12,226,200
L2_LINES_IN.SELF.ANY events = 667,740,416
RS_UOPS_DISPATCHED events = 83,848,935,312
RS_UOPS_DISPATCHED.CYCLES_NONE events = 63,349,782,300
L2 Cache Miss Rate = 0.004
-> from which we obtain:
* Total stall time = 43,04%
* Stall time due to L2 misses = 2,49% (assuming a 300 cycles miss penalty)
System B
MEM_LOAD_RETIRED.L2_LINE_MISS events = 78,123,771
L2_LINES_IN.SELF.ANY events = 521,005,131
RS_UOPS_DISPATCHED events = 94,765,782,112
RS_UOPS_DISPATCHED.CYCLES_NONE events = 51,910,907,586
L2 Cache Miss Rate = 0.003
-> from which we obtain:
* Total stall time = 35,39%
* Stall time due to L2 misses = 15,98% (assuming a 300 cycles miss penalty)
So, if we look at "L2_LINES_IN" and "L2 Cache Miss Rate", system B seems to be less affected by L2 cache misses than system A. However, if we look at "L2 miss stall time" (derived from MEM_LOAD_RETIRED.L2_LINE_MISS) the conclusion is totally the opposite (2,49% L2 miss impact for A, against 15,98% for B). So, my question is: what can I conclude from these numbers that VTune reported?
I think a quite common doubt of VTune users is: What is the difference between "MEM_LOAD_RETIRED.L2_LINE_MISS" and "L2_LINES_IN.SELF.ANY" and which conclusions we can draw from each of these metrics?
Let me explain better. I have two different systems, say A and B, and run the same workload, on top of the same HW, for both (HW prefetching and adjacent cacheline were both disabled). My goal is to find out which of them has the most optimized memory access, so I measured a couple of memory-related metrics. The results are shown below:
System A
MEM_LOAD_RETIRED.L2_LINE_MISS events = 12,226,200
L2_LINES_IN.SELF.ANY events = 667,740,416
RS_UOPS_DISPATCHED events = 83,848,935,312
RS_UOPS_DISPATCHED.CYCLES_NONE events = 63,349,782,300
L2 Cache Miss Rate = 0.004
-> from which we obtain:
* Total stall time = 43,04%
* Stall time due to L2 misses = 2,49% (assuming a 300 cycles miss penalty)
System B
MEM_LOAD_RETIRED.L2_LINE_MISS events = 78,123,771
L2_LINES_IN.SELF.ANY events = 521,005,131
RS_UOPS_DISPATCHED events = 94,765,782,112
RS_UOPS_DISPATCHED.CYCLES_NONE events = 51,910,907,586
L2 Cache Miss Rate = 0.003
-> from which we obtain:
* Total stall time = 35,39%
* Stall time due to L2 misses = 15,98% (assuming a 300 cycles miss penalty)
So, if we look at "L2_LINES_IN" and "L2 Cache Miss Rate", system B seems to be less affected by L2 cache misses than system A. However, if we look at "L2 miss stall time" (derived from MEM_LOAD_RETIRED.L2_LINE_MISS) the conclusion is totally the opposite (2,49% L2 miss impact for A, against 15,98% for B). So, my question is: what can I conclude from these numbers that VTune reported?
Link Copied
0 Replies

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page