The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B & 3C): System Programming Guide,
Table 19-2. Non-Architectural Performance Events In the Processor Core of ThirdGeneration Intel Core i7, i5, i3 Processors
L2_RQSTS.PF_HIT Counts all L2 HW prefetcher requests that hit L2.
L2_RQSTS.PF_MISS Counts all L2 HW prefetcher requests that missed L2.
However I am unsure what that means.
1) Does hit mean that the L2 prefetcher decided to get some data, but found that it was already in L2? In case the requested data was not present or invalid in L2, does that mean it was a miss? Or is this a prefetcher miss, i.e. the prefetched data was never used before it was evicted?
2) When there was a miss, does that mean that the prefetcher will copy data to L2 every time and if there was a hit it will not? Or what does this entail?
Thanks for your help,
I have not tested these sub-events extensively, but I believe that your first guess is correct: L2_RQSTS.PF_HIT is incremented when an L2 hardware prefetch is issued and the corresponding line is found in the L2 cache. This could be considered a "wasted" prefetch, but since it is only accessing the tag array and not the data array it is unlikely to be a performance problem. I don't know if the L2 hardware prefetchers modify their behavior when these hits occur.
Conversely, L2_RQSTS.PF_MISS is incremented when an L2 hardware prefetch is issued and the corresponding line is not found in the L2 cache. This is the preferred case, and should be the most common.
An L2 prefetch that misses in the L2 is not guaranteed to bring the data into the L2 cache. As described in section 2.2.5 of the Intel Optimization Reference manual, L2 hardware prefetches always bring the data into the LLC, but do not always bring the data into the L2 cache. The text says "Typically data is brought also to the L2 unless the L2 cache is heavily loaded with missing demand requests." However, a few sentences down an additional case is noted: "When cache lines are far ahead, it prefetches to the last level cache only and not to the L2."
On my Xeon E5-2680 (Sandy Bridge EP) systems running the STREAM benchmark with a single thread (note 1), the load source counters (event D1, masks 01/02/04) typically tell me that 10% of the loads hit in the L1, almost 40% in the L2, and about 10% in the LLC, leaving about 40% from memory. In reality all of the data is coming from DRAM, but the prefetchers are able to get 60% of the loaded data into some level of the cache before the load that requests that data actually arrives. I have seen some cases in which most of the prefetches go into the LLC instead of the L2, but I can't find an example of that right now....
(Note 1: Be aware that the Event D1 counts do not give useful answers for AVX loads -- I wish that had been documented in Volume 3 of the SW Developer's Guide or in the Processor Specification Updates instead of buried in Appendix B.3.4.1 of the Optimization Reference Manual!)