Why is perf l2_rqsts.all_pf not 0 when I disable prefetchers with msr tools?

esal_04 · ‎06-06-2024

I am using an Intel Xeon Gold processor (Cascade lake, 06_55H).

Also, using perf to monitor some performance counters, including l2_rqsts.all_pf.

According to Table 2-20, I can control hardware prefetchers with bits 3:0 of register address 0x1A4. When I disable all (write 0xF), I see the value of l2_rqsts.all_pf reduced by two orders of magnitude but not go completely down to zero.

Why is that?

McCalpinJohn · ‎06-10-2024

I have seen the same thing on SKX/CLX processors.

My suspicion is that the Next-Page-Prefetcher (which does not have a documented method of disabling) acts to generate a small number of L2 hardware prefetches.

Looking at the average counts over multiply trials of one specific test case, I see an average reduction of 974:1 The next-page prefetcher should generate one or more prefetches every 64 cache lines for this contiguous test case -- obviously they cannot all be generating hardware prefetches or the observed counts would be much higher (almost 16x).

One reason to have a Next-Page-Prefetcher is to generate a reference to the next (virtual memory) page early, so that if the address misses in the TLB, the TLB walk will be started earlier and perhaps completely overlapped with the transfer of the rest of the data from the current page. For contiguous virtual memory accesses, each cache line fetched by a Page Table Walk will contain 8 Page Table Entries, so each Page Table Walk will cover 8 pages, or 512 cache lines.

SPECULATION: I can imagine that the designers might have optimized the Next-Page-Prefetcher by taking into account the 8 Page Table Entries per cache line and only generating a HWPF to pre-load the next cache line full of TLB entries once every 8 4KiB pages. It should be possible to test this, but I have not looked at the details -- the transaction counts are too small to impact performance.