I can see lots of quite contradictory information on the internet about several of the hardware events, and I'm hoping someone can clear this one up for me (or at least point me to the 'correct' manual that I can use as definitive event descriptions). My CPU is Intel® Xeon® Processor E5-2670 (Sandy Bridge).
Chapter 19 Performance-Monitoring Events of the system programming guide, in Section 19.6 for the Sandy-Bridge events (page 19-44 Vol. 3B or 340 of the PDF):
Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf (PDF)
This states that L2_RQSTS.ALL_DEMAND_DATA_RD "Counts any demand and L1 HW prefetch data load requests to L2." It seems like this suspiciously goes against the 'demand' nomenclature, but that's OK if it includes L1 HW prefetching.
However, I have found contradictory information on Intel® 64 and IA32 Architectures Performance Monitoring Events manual, at https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com (PDF), at "Performance Monitoring Events based on Sandy Bridge", on page 142 (143 of the PDF).
This states that L2_RQSTS.ALL_DEMAND_DATA_RD counts "Demand Data Read requests." Some other architectures in this manual describe the same event as including L1 HW prefetcher requests, so this seems like an intentional omission from this specific architecture.
Which is correct? Is one of these more 'definitive' than the other? From some quick tests, I think it does include prefetching, but I can't be sure. Some online advice uses the event as if it does include prefetching, but some other locations definitely describe it as 'only demand data reads' which presumably excludes prefetching.
Thanks for any help!
Intel's documentation has always been ambiguous on this issue -- and on myriad similar distinctions....
I have not bothered to try to disambiguate this one because the L1 HW prefetchers don't generate very many prefetches, and when they do generate prefetches, they are to lines that are extremely likely to be used. If you think you have a workload that might have significant "bad" L1 HW prefetches, you can always turn off the L1 HW prefetchers to see if the counts on this event change.
I have not seen any evidence that L2_RQSTS.ALL_DEMAND_DATA_RD is biased upward by counting both an L1 HW prefetch and a subsequent demand load to the same address, but there are (as always) lots of special cases to consider.... From my studies on Haswell a few years ago, it looks like L2_RQSTS.ALL_DEMAND_DATA_RD does not increment when transactions are rejected/retried, while L2_TRANS.DEMAND_DATA_RD does increment on retries.
Erratum HSD78 for the Haswell desktop processors says that event 0x24 (L2_RQSTS) can incorrectly count fetches from the Next Page Prefetcher when attempting to count Demand Data Reads. Interestingly on this platform, neither the L2 demand read miss nor the L2 demand read hit sub-events mention L1 HW prefetches, but the combined "all demand data read" sub-event does say that it includes L1 HW prefetches. I have not tried comparing the three events from a single run to see if the combined event is greater than the sum of the hits and misses. The event on Haswell defines the Umask bits quite differently than on Sandy Bridge, so there are probably important implementation differences. (Not that Haswell or other processors have more clarity in the definitions -- they just have different ambiguities....)
The best description I have seen of the operation of the L1 HW prefetcher is in the first response at https://stackoverflow.com/questions/53517653/in-which-condition-dcu-prefetcher-start-prefetching
Hi Dr. McCalpin,
Sorry for the late reply - my question was asked in the midst of some investigative work, for which your reply was very helpful. Although I haven't (at least yet) done a quantitative experiment for the impact of the L1 prefetchers, I just wanted to be a bit more confident in the event definition - thanks!
At least on Skylake, the L2_RQSTS counter has the capability to distinguish L1 prefetch (SW and HW are lumped together though) and L1 demand requests and the event definition for "demand read" doesn't include L1 prefetches. More details at: