I am trying to understand the performance counters related to L2 misses on Haswell microarchitecture. Can someone tell me why is L2_RQSTS:MISS counter value greater than OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE? Sometimes these two counter values are very close but for some benchmarks, L2_RQSTS:MISS is around 20-30% more than OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE. Is that because L2 misses for the cache line already being serviced do not generate offcore responses? Or is there any other reason? Thanks in advance.
Thanks. So here's what I have observed till now:
Ivy Bridge (Core i7 - 3770): I can get L2_RQSTS and OFFCORE_RESPONSE counters to match after I turn off L2 prefetcher:
task -i -e L2_RQSTS:DEMAND_DATA_RD_HIT,L2_RQSTS:ALL_DEMAND_DATA_RD,OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE,L2_RQSTS:RFO_MISS,OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE ./cmd 282424771 L2_RQSTS:DEMAND_DATA_RD_HIT 453431389 L2_RQSTS:ALL_DEMAND_DATA_RD 171006619 OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE 93759998 L2_RQSTS:RFO_MISS 93759998 OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE
Haswell (Core i7 - 4770): With just L2 prefetcher turned off, I cannot get L2_RQSTS and OFFCORE_RESPONSE to match:
task -i -e L2_RQSTS:DEMAND_DATA_RD_MISS,OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE,L2_RQSTS:DEMAND_RFO_MISS,OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE ./cmd 147112571 L2_RQSTS:DEMAND_DATA_RD_MISS 174721797 OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE 92979516 L2_RQSTS:DEMAND_RFO_MISS 47244887 OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE
When I turn off both L2 prefetcher and L1 prefetcher (by writing 0xF to MSR 0x1A4), I get demand data reads to match but not RFO.
task -i -e L2_RQSTS:DEMAND_DATA_RD_MISS,OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE,L2_RQSTS:DEMAND_RFO_MISS,OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE ./cmd 158255799 L2_RQSTS:DEMAND_DATA_RD_MISS 158291310 OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE 94228412 L2_RQSTS:DEMAND_RFO_MISS 48380242 OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE
I have verified Haswell results on two different machines.
In my tests on SNB-EP systems I did not see any changes when I disabled the L1 HW prefetchers, but the test code that I used only has contiguous access that either all miss or all hit at any level of the cache. There is very little in the documentation that helps to understand how the counters treat L1 HW prefetches, so I think quite a broad suite of tests would be required to understand what is going on.
The results so far show that there is something funny going on with L2 HW prefetches -- in my testing I see 20%-30% *fewer* demand read access to the L2 (hits + misses) when L2 HW prefetching is enabled. That suggests that the data fetched by the L2 HW prefetchers is getting "picked up" by demand read accesses via a different path -- one that does not increment this performance counter. (Can anyone think of another scenario that would fit these results?)
We know that there are significant undercounts for the LLC events due to some kind of bypassing, since the "partial workarounds" involve (mostly undocumented) features with names like "disable bypass", so it is certainly not implausible that bypass mechanisms exist at the L2 cache level as well. The Haswell results above suggest that L1 HW prefetches are putting data in a bypass path that can be picked up by L1 Data Cache miss demand reads -- something that did not happen on SNB and IVB, but which is not a surprising evolution.
The low counts for OFFCORE_RESPONSE:DMND_RFO:ANY_RESPONSE on the Haswell system might point to a simple bug in the OFFCORE_RESPONSE counter event, or may point to a bypass path that is available to RFOs on Haswell, but not on earlier systems, and which can be activated by demand RFOs (i.e., does not require HW prefetches to activate).
Another possibility is that Haswell handles some interactions between demand accesses and prefetches differently. For example, if an L2 miss buffer is allocated by a HW prefetch and a demand access reaches the buffer before the data is returned, the request type associated with the buffer might be changed and the information about the *original* request type could be lost.
Another possibility is that Haswell has different policies for HW prefetching. Intel's documentation has always been a bit sparse about RFO prefetches (for both L1 and L2 HW prefetchers), and there is considerable fuzziness about the algorithm used to decide whether the L2 HW prefetches will bring the data into the LLC or into the LLC and the L2. One could also imagine L1 prefetches bringing data into the LLC and L1, but not allocating it in the L2. Unfortunately it is tricky to study any of these topics when you have to deal with uncertainties in both the underlying hardware behavior and the accuracy of the performance counters.
It might be interesting to see if the bypass-disable workarounds for the LLC undercounting on SNB have any impact on these L2 counts.
Unfortunately, the first-order takeaway is that these L2 access counters do not appear to give reliable results under normal operation (i.e., with the L2 HW prefetchers active). The offcore response counters look reliable on SNB and IVB, but (at least) the RFO sub-event might be unreliable on Haswell. This deserves more investigation -- perhaps those events are picked up in another category, or perhaps we will need to go to the CBo counters to capture the information we need. (I have had trouble wrapping my head around the CBo counter definitions, so I have not included them in most of my analyses so far -- now that the Haswell EP Uncore Performance Monitoring Guide is available, I guess it is time to get to work on adding them.)
I have a lot of sympathy for the folks that have to implement the core performance counters -- it is extremely difficult to architect a monitoring facility when the thing being monitored is being designed by a different team and when that thing becomes more complex in unexpected ways from one generation to the next. It is even worse when what you are trying to monitor involves the interaction of two or more subsystems being designed by two or more teams -- their primary focus has be to (1) get the interaction right, and (2) make its performance better than the previous version. Making sure that all paths between the units report information to the performance monitoring units in the way that the performance monitors expect does not have the same kind of career-determining implications as (1) and (2). It gets worse when there is a fundamental inconsistency between what the performance monitoring unit wants and what the innovative new hardware design actually does. This is one of the main reasons why it has been so hard for any vendor to have a stable set of performance event definitions.