How does the next PAGE hardware prefetcher (NPP) work?

vny · ‎04-10-2024

L2/L3 hardware prefetcher behaviors [are well documented (their aggressiveness, their tiggers, etc)][1]. However, very little is known about the next-page prefetcher that supposedly pre-loads the TLB when access patterns are sequential. Other than this [Intel forum discussion][2], I have not been able to find anything of significance. Not even in the Intel Software Development Manual. I am interested in knowing the following about the prefetcher:

How do I check if this feature is available on my processor?
How is it triggered? Accesses to first cache line of contiguous pages in virtual address space (the above discussion says page prefetches made when virtual address are accessed sequentially)
How aggressive is the prefetching? That is how many pages are prefetched ahead of current page access? Does this depend on the page granularity set (4 KiB, 2 MiB, etc)?

The Intel Optimization Reference Manual [`4.1.7 Cache and Memory Subsystem`] sheds a little light on this:

There are three independent L1 prefetchers. One does a simple next-line fetch on DL1 load misses. The second is an instruction pointer based prefetcher capable of detecting striding access patterns of various sizes. This prefetcher works in the linear address space so it is capable of crossing page boundaries and starting translations for TLB misses. The final prefetcher is a next-page prefetcher that detects accesses that are likely to cross a page boundary and starts the access early. L1 data misses generated by these prefetchers communicate additional information to the L2 prefetchers, which help them work together.

[1]: https://safari.ethz.ch/architecture/fall2020/lib/exe/fetch.php?media=onur-comparch-fall2020-lecture18-prefetching-afterlecture.pdf
[2]: https://community.intel.com/t5/Software-Tuning-Performance/Is-next-page-prefetcher-available-on-Haswell-microarchitecture/td-p/1100229

McCalpinJohn · ‎05-21-2024

I have also never found any documentation for the next-page-prefetcher (NPP), but I have learned a little bit about it through directed testing. I probably did these tests on the Skylake Xeon processors...

As far as I can tell the NPP only fetches one cache line from the next (virtually-address) page. This has a negligible influence on bandwidth by itself, but it does have two important side effects:

When using 4KiB pages, the NPP prefetch will cause a speculative page table walk in the event of a TLB miss. This eliminates almost all of the demand TLB miss page table walks.
The NPP gets the L2 HW prefetcher started sooner. This is a modest effect, but is one of the only options to improve performance as the number of outstanding misses required to tolerate the memory latency keeps increasing (and Intel is still limited to HW prefetching within 4KiB pages).

Siyabonga · ‎05-28-2024

To understand the next-page prefetcher in Intel processors and its behavior, here’s a detailed look at the aspects you’re interested in:

1. Checking if the Feature is Available on Your Processor

To check if your processor supports the next-page prefetcher, you can:

Refer to Processor Documentation: Look for your specific processor's documentation on Intel's website or the Intel Software Developer Manual. Unfortunately, explicit details about next-page prefetchers might not always be present.
Use CPUID Instruction: The CPUID instruction can provide information about various features supported by the CPU. Although there might not be a direct flag for next-page prefetcher, related cache and memory management features might give hints about its presence.
BIOS/UEFI Settings: Some prefetching features can be enabled or disabled from the BIOS/UEFI settings. Check if there’s an option related to memory prefetching.

2. Trigger Mechanism

The next-page prefetcher typically gets triggered when sequential accesses are detected in the virtual address space. More specifically:

Sequential Accesses: When the CPU detects accesses to the first cache line of contiguous pages in the virtual address space, it might trigger the next-page prefetcher.
Pattern Recognition: The prefetcher looks for patterns of sequential accesses, and once a pattern is identified, it prefetches the next page.

3. Aggressiveness of the Prefetching

The aggressiveness of the next-page prefetcher, such as how many pages it prefetches ahead, can vary:

Prefetch Depth: The exact number of pages prefetched ahead can depend on the implementation and the specific processor model. Some processors might prefetch one or two pages ahead, while others might be more aggressive.
Page Granularity: Prefetching behavior might also be influenced by the page size used (e.g., 4 KiB, 2 MiB). Larger pages might lead to different prefetching strategies.

Reference from Intel Optimization Manual

The Intel Optimization Reference Manual (particularly section 4.1.7 Cache and Memory Subsystem) provides some insights into the general behavior of hardware prefetchers but might not explicitly detail the next-page prefetcher. The manual discusses:

Cache and Memory Subsystem Optimization: General strategies and recommendations to optimize memory usage and cache behavior.
Prefetcher Behavior: Describes different types of prefetchers (L1, L2, etc.) and how they can be optimized for performance.

Additional Resources

Given the limited direct documentation, consider these additional steps:

Intel Forums and Communities: Engage in Intel’s community forums or other technical forums where processor behavior is discussed. The Intel forum discussion you mentioned can be a starting point for community-driven insights.
Technical Papers and Benchmarks: Look for academic papers or technical benchmarks that study and analyze prefetcher behavior in Intel processors. These often provide in-depth analysis and can give clues about undocumented features.
Performance Monitoring Tools: Utilize performance monitoring tools and counters that can help analyze the behavior of the CPU under various workloads. Tools like Intel VTune or PCM (Performance Counter Monitor) can be useful.

Understanding these prefetching mechanisms requires a combination of direct documentation, community insights, and empirical analysis through performance monitoring. By leveraging these methods, you can gain a better understanding of how the next-page prefetcher operates on your specific hardware.