I was exploring how different prefetchers of my Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz system behave. I have performed some experiments to understand when these prefetchers are invoked and how many lines are prefetched.
I put all but one prefetcher disable to understand its behavior.
I am able to understand the working of 3 prefetchers namely L1 IP Prefetcher, L2 adjacent line prefetcher and L2 H/W prefetcher.
Here are my results :
1) L1 IP prefetchers starts prefetching after 3 cache misses (X,X+d,X+2d). It only prefetch on cache hit and only one cache line (X+3d) is prefetched.
2) L2 Adjacent line prefetcher starts prefetching after 1st cache miss and prefetch on cache miss. It also prefetch one cache line.
3) L2 H/W (stride) prefetcher starts prefetching after two cache misses (X,X+d) and prefetch on cache hit. It also prefetch only one cache line (X+2d).
I am not able to understand working of L1 adjacent line prefetcher - after how many cache line it starts prefetching, how many cache line it prefetch and does it prefetch on cache hit/miss ?
Is there any way to find about L1 adjacent line prefetcher
Intel does not typically disclose the details of the hardware prefetch algorithms.
There are some exceptions. The "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-040, April 2018) presents more detail than usual in section 126.96.36.199 for the Sandy Bridge Core. The L1 streaming prefetcher ("DCU prefetcher") is described as fetching one line beyond an ascending stream of addresses.
The description of the behavior of the "L2 streaming prefetcher" differs from your observations. This prefetcher can fetch either one or two prefetches in response to each L2 lookup and can run up to 20 lines ahead of the most recent load request.
Your processor (Core i7-3770) is based on the Ivy Bridge core, which is very similar to the Sandy Bridge core. One change relevant to your investigation is the addition of a "next page prefetcher", which is "triggered by sequential accesses to cache lines approaching the page boundary, either upwards or downwards". The text does not say how many lines are fetched, but my experiments suggest that it only fetches one cache line. The biggest impact of the next page prefetcher is that it causes the TLB lookup for the next page to happen earlier, and will trigger a page table walk if one is required. I am not aware of any documentation beyond the two sentences in section 2.4.7 of the software optimization manual, but my experiments show that this next page prefetcher also exists in the Haswell core. The Skylake core has two page table walkers (section 2.2.3 of the software optimization manual), which can be useful when HyperThreading is enabled, but also reduces the potential for slowdown if the next page prefetcher causes a useless page table walk that (on cores with a single page table walker) delays the execution of a useful page table walk.
I am not aware of any such "L1 adjacent line prefetcher". There is an L2 adjacent line prefetcher, but the two L1 prefetchers are described as:
DCU prefetcher : Fetches the next cache line into L1-D cache
DCU IP prefetcher : Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines
Both are a bit vague (what does "next" mean for example), but neither is really an adjacent line prefetcher, which gets the line that completes a 128-byte pair (i.e., "adjacent" can mean both the immediately preceding or subsequent cache line).
The experimental results quoted at the top of this thread make sense as to how many misses trigger each prefetcher. There is a fixed limit on the number of prefetch streams, which may vary with the CPU model. The limit is much smaller for decreasing addresses. If a new stream of prefetches is required, another stream of prefetches is disabled, so the data associated with the discarded prefetch stream would no longer be "very recently loaded." Prefetch streams also are disabled when they reach the end of a page (4KB by default). Recent CPUs added a next page prefetcher in order to make a sort of exception to that rule, as Windows OS (unlike linux) doesn't have a "huge TLB" facility. It seems to be a sort of TLB prefetcher. It ameliorates the effect of TLB miss (when strided prefetch is active?) but still the strided prefetch will stop at the page boundary.
From Sandy Bridge on, the strided prefetcher has a dynamically adjusted prefetch distance. It may, as you said, prefetch just one cache line at a time, but repeated accesses within a prefetch stream will trigger prefetch of additional cache lines at a faster rate than they are consumed, up to some limit (in addition to the page boundary limit). The limit on prefetch distances will decrease as more cache misses are outstanding (including those associated with prefetch), so number of threads will influence it. That may be useful, as a core prefetching lines which another core is about to modify would degrade performance.
Even older CPUs would run by default with both strided and adjacent line prefetchers active, so a strided prefetch triggers an adjacent line prefetch. For some CPU models, including Sandy Bridge, there was a published guide as to which classes of application (data base, in particular) generally benefit from disabling adjacent line prefetch. In my own experience, the usefulness of adjacent prefetch decreased with increasing number of threads. I didn't investigate in much detail, but strided prefetch also becomes less effective as more threads are active within a single LLC. I had applications which used up to 10 cores (but not more) quite effectively.