Instruction prefetcher - missing from Optimization Manual

Russell_Van_Zandt · ‎02-27-2019

As detailed below, the instruction prefetcher is not documented in Intel's Optimization Reference Manual (April 2018 248966-040). Besides eventual update of the documentation I request a recommendation of how to prefetch several dozen short assembly language procedures. The Intel processors are mostly Xeon Scalable Skylake AVX-512 model 06_55h with some Broadwell Xeon E5 v4 06_4Fh. So these processors have roomy L2 cache and my thought is to preload there, avoiding possible thrashing of the L1 and micro-op caches. Section 3.4.1.5 of the Optimization Manual promises 'See Section 3.7, "Prefetching," on optimizing the instruction prefetcher'. But there is no information about the instruction prefetcher there, only a title. Section 3.7 begins with three bullets: . Hardware instruction prefetcher. . Software prefetch for data. . Hardware prefetch for cache lines of data or instructions. The next section combines the first two bullets in its title but has absolutely no information about the first topic, the hardware instruction prefetcher: "3.7.1 Hardware Instruction Fetching and Software Prefetching" This combination is most unfortunate because Intel warns us in three different sections of the same manual NOT to use the software prefetcher for instruction

: "The software-controlled prefetch is intended for prefetching data, but not for prefetching code." (sections 2.5.4.2, 2.4.5.4, 7.5.1)

HadiBrais · ‎02-27-2019

My understanding of the L1 instruction cache prefetcher is that it prefetches cache lines sequentially. If the branch prediction unit predicts that the control flow will change to a particular address, the instruction prefetcher will prefetch that line and zero or more lines that sequentially follow the target line. The instruction prefetcher may become inactive when there are no accesses to the instruction cache, i.e., uops are being delivered from the DSB or LSD. On a DSB miss, the fetch unit will attempt to fetch that target line from the instruction cache, which may then trigger the instruction prefetcher. We know that all of the data prefetchers cannot cross a 4K boundary. This restriction may apply as well to the instruction prefetcher.

The instruction cache is inclusive of the uop cache (also called the decoded icache or DSB), which in turn is inclusive of the LSD (the LSD is disabled on all Skylake processors, but enabled on Broadwell). In addition, on BDW, the L2 is non-inclusive of the instruction cache and the L3 is inclusive of the instruction cache. On SKX, the L2 is inclusive of the instruction and the L3 non-inclusive of the instruction cache. Therefore, if an instruction cache line got evicted from an inclusive cache, it will be evicted as well from the instruction cache, uop cache, and the LSD.

Instruction cache misses typically occur when there is a miss in the uop cache and when one of the following situations occur:

A branch misprediction occurs and the actual target line is not in the instruction cache either because it has never been fetched before or it got evicted because the the size of the code being executed is too large to fit in the instruction cache or because it got evicted from an inclusive cache.
Instruction cache lines may get evicted because the working set size is too large to fit in the cache level that includes the instruction cache and the data is being accessed temporally.

I don't think it makes sense to use PREFETCHT0 to prefetch an instruction cache line into the L1D because if that line was not found in the L2 when it is needed, it will not be fetched from the L1D, but from the L3. One may get the impression that it might be useful to use PREFETCHT1 to prefetch instruction lines into the L2 and L3 (but not L1). But the problem here is that on a TLB miss, a software prefetch instruction will fill the page table entry in the data TLB, not the instruction TLB. Then on an instruction cache miss, the ITLB would still not contain the required page table entry even if the the instruction cache line is in the L2. Although you may find the entry in the STLB. So it's not clear to me how beneficial it would be to use PREFETCHT1 to prefetch an instruction cache line into the L2. But you can try. The following counters may be useful to evaluate the effectiveness of this approach: FRONTEND_RETIRED.ITLB_MISS, ITLB_MISSES.STLB_HIT, L2_RQSTS.CODE_RD_HIT, and L2_RQSTS.CODE_RD_MISS. FRONTEND_RETIRED.ITLB_MISS, is not supported on Broadwell. You can use instead ITLB_MISSES.MISS_CAUSES_A_WALK. On the other hand, using PREFETCHT0 to prefetch an instruction cache line is not necessary and can have a negative impact on performance.

It might be more effective to reduce the branch mispredictions that result in instruction cache misses so that the required lines are prefetched automatically for you. Or you can also reduce the working code set size or the working data set size. In fact, this is how the impact of instruction cache misses is traditionally alleviated.

You can also try allocating the code in huge pages (2MB). The uop cache (and therefore the LSD too) is also fully flushed when an ITLB entry is evicted.

There are no instructions to prefetch into the instruction cache, the uop cache, or the LSD. The only way to bring instructions into the uop cache or the LSD is by executing them at least speculatively (on a mispredicted path).

Travis_D_ · ‎02-28-2019

Basically what Hadi said. A couple more points:

If you don't care about total performance including the cost of prefetching, but only want to make sure the targeted instructions are in the cache when they are executed (for example, to avoid i-misses in a benchmark, or because you have idle time to do prefetching before the "real work" begins, you can do a type of poor-man's prefetching by executing code next to the targeted codes. E.g., start a function at a few bytes into a cache line and then place a "stub" function that simply jumps back the caller in the first few bytes, and execute this stub function to prefetch the line. You can extend this to multiple lines through various methods. You don't even need to execute in the same line: if you linearly execute in cache lines prior to the targeted code, the natural i-fetch and i-prefetch mechanisms will bring in a few of the following lines. I have used both of these techniques to "warm" code prior to executing it. You can perhaps get code all the way into the DSB if you are lucky.

Finally, I don't 100% agree with Hadi about the ineffectiveness of L2 prefetching. I think L2 prefetching is likely to be very effective for code that would otherwise miss to DRAM. Yes, you don't warm the ITLB or the L1I, but you warm the L2 and STLB, so you are taking something like a dozen cycles for the the first execution, compared to 100s if you miss to DRAM. So it's something you could experiment with if you are suffering that type of miss.

ychen354 · ‎11-10-2020

Can any CPU implements prefetch both branch of a branch instruction to ensure the instruction always cached even in branch predication failure.

for example:

add 10, %rax

cmp %rax, %rcx

je L10

add 20, %rbx

...

L10:

add 30, %rcx

...

cpu can prefetch both instruction at L10 and after je

then, event if the predication fail. code always stay in icache already.