Optimizing instruction prefetcher (deleted from Intel's Optimization Guide)

Russell_Van_Zandt · ‎10-21-2019

Does anyone have information about optimizing the instruction cache? This used to be in Intel's Optimization Guide but was deleted awhile back. The current Optimization Guide lures us by describing the benefits of optimizing, but the promised material is missing: 3.4.1.4 "Careful arrangement of code can enhance cache and memory locality ... See Section 3.7, 'Prefetching', on optimizing the instruction prefetcher." But even though that 3.7 has a bullet point "Hardware instruction prefetcher" this is no subheading or material. What little information is available just makes it sound like prefetching is sequential, which is not surprising. is "code locality" ALL there is to it?

HadiBrais · ‎12-09-2019

The instruction prefetcher submits requests to the instruction cache to prefetch instruction bytes sequentially in the virtual address space. The goal is to keep the pipeline fed with a steady stream of instructions so that it doesn't stall due to instruction cache misses. Certain situations cause the instruction prefetcher to be resteered (i.e., it will start prefetching from a non-sequential location). These include taken branch predictions, software events, and non-blocked hardware events.

The advice of "Careful arrangement of code can enhance cache and memory locality" refers to laying out hot code blocks sequentially in (virtual or physical) memory to maximize the utilization of the instruction cache capacity. That's because of how the placement policy works: sequential cache lines are mapped to sequential sets in the instruction cache. So to make use of all of the cache slots and avoid conflict misses, all accessed instruction cache lines should be sequential and should fit within the instruction cache (for each program phase, such as a hot loop or function). (The same advice applies to the data cache.) This is important for two reasons. First, it avoids repeated fetching of the same lines, which wastes energy and potentially leads to instruction cache miss pipeline stalls. Second, the instruction cache is inclusive of the uops cache and LSD buffer, so making sure that hot lines fit in the instruction cache helps getting hits in the uops caches. It's worth mentioning that ITLB misses can significantly increase the effective penalty of an instruction cache miss.

Section 3.7 never contained information on optimizing the instruction prefetcher or instruction cache as far as I know. It used to mention that the instruction prefetcher reads instructions from the instruction cache. This was removed, but the title of Section 3.7.1 and the first paragraph of Section 3.4.1.4 were not updated. Also Section 3.7 starts by saying that recent Intel processor families employ hardware instruction prefetching, but even the Intel 8086 has an instruction prefetcher.

There are several hardware performance monitoring events that can be used to quantify bottlenecks related to the instruction cache and ITLBs and these are accounted for in the Frontend Bound branch of Intel's top-down analysis methodology.