The instruction prefetcher submits requests to the instruction cache to prefetch instruction bytes sequentially in the virtual address space. The goal is to keep the pipeline fed with a steady stream of instructions so that it doesn't stall due to instruction cache misses. Certain situations cause the instruction prefetcher to be resteered (i.e., it will start prefetching from a non-sequential location). These include taken branch predictions, software events, and non-blocked hardware events.
The advice of "Careful arrangement of code can enhance cache and memory locality" refers to laying out hot code blocks sequentially in (virtual or physical) memory to maximize the utilization of the instruction cache capacity. That's because of how the placement policy works: sequential cache lines are mapped to sequential sets in the instruction cache. So to make use of all of the cache slots and avoid conflict misses, all accessed instruction cache lines should be sequential and should fit within the instruction cache (for each program phase, such as a hot loop or function). (The same advice applies to the data cache.) This is important for two reasons. First, it avoids repeated fetching of the same lines, which wastes energy and potentially leads to instruction cache miss pipeline stalls. Second, the instruction cache is inclusive of the uops cache and LSD buffer, so making sure that hot lines fit in the instruction cache helps getting hits in the uops caches. It's worth mentioning that ITLB misses can significantly increase the effective penalty of an instruction cache miss.
Section 3.7 never contained information on optimizing the instruction prefetcher or instruction cache as far as I know. It used to mention that the instruction prefetcher reads instructions from the instruction cache. This was removed, but the title of Section 3.7.1 and the first paragraph of Section 22.214.171.124 were not updated. Also Section 3.7 starts by saying that recent Intel processor families employ hardware instruction prefetching, but even the Intel 8086 has an instruction prefetcher.
There are several hardware performance monitoring events that can be used to quantify bottlenecks related to the instruction cache and ITLBs and these are accounted for in the Frontend Bound branch of Intel's top-down analysis methodology.