Section 126.96.36.199 of the Intel optimization manual (248966-040, April 2018) includes a list of conditions that must be met in order for prefetching into the L1 to be triggered. One of these conditions is: "No fence is in progress in the pipeline."
What does this condition mean exactly? I've tried to run some experiments to see if a "fence" instruction that is "in progress" affects the behavior of the DCU prefetcher, but I couldn't observe any change in its behavior.
Section 188.8.131.52 might be specific to SnB, but I've only run experiments on HSW. I'm not sure to what extent the conditions listed in that section apply to later microarchitectures. Also the way the list is written suggests that it applies to both L1 prefetchers, so I'm wondering whether the condition "No fence is in progress in the pipeline." applies only to the IP prefetcher and not the DCU prefetcher. Either way, what does it mean? I'm particularly interested in observing this condition empirically.
I've checked older versions of the optimization manual. The phrase "No fence is in progress in the pipeline" appeared with the introduction of Sandy Bridge. So it could be specific to that microarchitecture, which I have not run any experiments on regarding how prefetching works.
I have never had much luck trying to understand the L1 HW prefetchers. They are fairly conservative, so they don't provide much performance boost or performance penalty in any codes that I have reviewed. To make it more frustrating, the performance counter documentation has always been particularly weak at documenting any distinction between counts for L1 HW prefetches and those for demand accesses.
The "no fence is in progress in the pipeline" is irritatingly vague. LFENCE will prevent future instructions (including demand loads) from executing, so it is possible that they are commenting that this also carries over to the HW prefetchers. MFENCE will prevent future memory references from executing, so the same interpretation may apply. I can't think of a reason why SFENCE would block L1 HW prefetches -- Section 184.108.40.206 says that L1 HW prefetches are generated by sequences of "load operations", but this is also ambiguous -- RFOs are often considered "load operations" once you get outside of the L2 cache.
McCalpin, John (Blackbelt) wrote:
To make it more frustrating, the performance counter documentation has always been particularly weak at documenting any distinction between counts for L1 HW prefetches and those for demand accesses.
At least on Skylake (and probably others) you can get some insight into L1 hardware prefetches using an undocumented feature of the L2_RQSTS counter: the 0x80 bit in the umask for that event isolates L2 accesses that are caused by L1D prefetches only, software and hardware. So if you do not have any software PFs in your code, using this bit you can count L1D hardware prefetches alone (this possibly includes accesses by the next-page prefetcher, I'm not sure).
Some more details in the "origin" table here.