Is next-page prefetcher available on Haswell microarchitecture?

Hui_K_ · ‎03-06-2016

Hi,
I am wondering if next-page prefetcher is available on Haswell E5-2620 v3.
If so, on which cachel level, L2 or last level cache.

I am asking this because I read the following from the
"Intel® 64 and IA-32 Architectures Optimization Reference Manual" (version
September 2015). Section 2.3.7 says that "next-page prefetcher" is an
enhancement on the Ivy Bridge microarchitecture. So it is not that clear if
haswell microarchitecture containers this feature. Thanks.

TimP · ‎03-06-2016

Thanks for pointing this out. I don't recall next page hardware prefetch being mentioned back then as one of the enhancements in Ivy Bridge, apparently accompanied by a major loss in effectiveness of software prefetch. Other quoted new features of Ivy Bridge were covered in Intel internal presentations prior to introduction of that CPU, although the increased bias of Loop Stream Detector in favor of 1 thread per core may not have been a favorite of marketing people wanting to brag about hyperthread support.

With this information about what to look for, I suppose some work with VTune might uncover whether the later CPUs inherit these features, as seems likely.

I've noticed considerations about attempting to keep threads distributed across all cores on Haswell while avoiding much "over-subscription" involving 2 threads on some cores.

The description of next page prefetch seems intentionally vague. For example, does it help with potential TLB miss? or does it make up for the interruption of hardware prefetch at page boundaries, in which case it might act as extension of the Streamer prefetch. So then it could share the mysterious feature of fetching to L3 and conditionally to L2, depending on how many other cache line requests are outstanding.

Back in 2.3.5, the comment about maintaining one forward and one backward stream per page looks interesting and possibly confusing.

McCalpinJohn · ‎03-07-2016

This issue is discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/593830

From those results it is very clear that my Xeon E5-2660 v3 system uses a next page prefetcher function to pre-load TLB entries for contiguous access patterns. The PAGE_WALKER_LOADS.DTLB_* events increment according to where the page table entry was found for either demand or prefetched TLB misses. In contrast, the DTLB_LOAD_MISSES.WALK_DURATION event appears to only increment for demand TLB misses and not for TLB walks due to the next page prefetcher.

Hui_K_ · ‎03-12-2016

Hi John and Tim, thanks for your reply.

John,

I would like to get some performance counter statistics on my Haswell E5-2620 v3. May I ask what are the tools you used to collect those numbers? Thanks.

- Hui

McCalpinJohn · ‎03-14-2016

Sometimes I can get away with using "perf stat" for whole-program performance counters. It does not usually give me enough control, but it is relatively easy so I use it when I can.

Sometimes I use the Intel Amplifier XE (VTune) program. This is very convenient for initial high-level searches for "hot spots" in the code. VTune's sampling-based approach is very helpful when the compiler generates multiple assembly language versions of a loop and you can't figure out which code is actually executed at run-time.

But most of the time I have to write my own code to read the performance counters inline before and after code that I am interested in testing.

I usually use the "rdmsr.c" and "wrmsr.c" command-line tools from "msrtools" to program the core performance counters (using the root account) before launching my program (as an ordinary user). Then I use inline assembly to insert the RDPMC instructions to read the performance counters at various spots in the program where I want to know the values.

For the uncore counters I either have to run as root (my most common approach) or hack the system to give one of my group ids permission to read/write the /dev/cpu/*/msr device drivers and either the PCI device drivers for the uncore devices or /dev/mem for direct access to PCI configuration space. The latter is a particularly unsafe approach and it is not recommended except as a last resort for expert users.

Hui_K_ · ‎03-20-2016

Hi, John,

I would like to repeat your experiment in that thread (https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/593830) on my Haswell E5-2620 v3 to study the NPP impact. I am using C to allocate pages in the user space. So the continuous page number in the user space may be discontinuous in the actual physical address.

May I ask you how do you control the continuity in your "small page test", i.e., allocate continuous pages for 64MB rage? Thanks. - Hui

McCalpinJohn · ‎03-20-2016

I did not make any effort to make the 4KiB pages contiguous in physical space. This means that the next-page-prefetcher must have access to the address stream *before* it is translated from virtual to physical. The prefetches are generated by virtual addresses and then translated by the TLB. This is consistent with the next-page-prefetcher being in the core, where it is triggered by accesses to sequences of virtual addresses.

There is very little documentation on Intel's L1 hardware prefetchers, and very few of the performance counters provide any additional information.