Hardware prefetch and shared multi-core resources on Xeon

Nathan_K_3 · ‎05-25-2016

I'm trying to understand the behavior of hardware prefetch from RAM on multi-core Xeon systems, particularly the situations in which high activity stops them from being used. The most detailed official description I've found is on page 2-29 of the Intel Optimization Manual :

Data Prefetch to L1 Data Cache
Data prefetching is triggered by load operations when the following conditions are met:
• Load is from writeback memory type.
• The prefetched data is within the same 4K byte page as the load instruction that triggered it.
• No fence is in progress in the pipeline.
• Not many other load misses are in progress.
• There is not a continuous stream of stores.
Two hardware prefetchers load data to the L1 DCache:
• Data cache unit (DCU) prefetcher. This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
• Instruction pointer (IP)-based stride prefetcher. This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.

Data Prefetch to the L2 and Last Level Cache
The following two hardware prefetchers fetched data from memory to the L2 cache and last level cache:
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
The streamer and spatial prefetcher prefetch the data to the last level cache. Typically data is brought also to the L2 unless the L2 cache is heavily loaded with missing demand requests.
Enhancement to the streamer includes the following features:
• The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.
• Adjusts dynamically to the number of outstanding requests per core. If there are not many outstanding requests, the streamer prefetches further ahead. If there are many outstanding requests it prefetches to the LLC only and less far ahead.
• When cache lines are far ahead, it prefetches to the last level cache only and not to the L2. This method avoids replacement of useful cache lines in the L2 cache.
• Detects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained.

Some questions:

This is in the Sandy Bridge section. I haven't seen details for more recent models. Are there changes to the details for generations more recent than this?

Does a "continuous stream of stores" mean one store per cycle, or is there a less stringent definition? How many outstanding requests qualify as "Not many other load misses are in progress"?

Are the Spatial Prefetcher or the Streamer ever prevented from bringing data to L3 due to outstanding misses, or does this apply to just to L1 and L2? Is one ever disabled but not the other?

When it says the Streamer "Detects and maintains up to 32 streams of data accesses", is this per virtual core, per physical core, per memory controller, or per socket?

More generally, what prefetch resources are shared between virtual cores on the same physical core? Between physical cores in the same socket, or with the same System Agent?

Are there any Performance Monitoring Counters that can be used to detect when the different hardware prefetchers have been temporarily disabled due to load?

TimP · ‎05-25-2016

By the end of your treatise, it appears that the forum https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring might be more useful for some of these questions, at least for discussion of related performance counters (or VTune forum for event monitoring).

The major augmentation in Haswell CPU appears to be the introduction of next page prefetching, to fill in the gap due to the prefetchers discussed here stopping at page boundaries (and lack of a Windows OS feature comparable to linux Transparent Huge Pages).

Evidently, when up to 32 streams are followed, those accesses can't be in consecutive cycles, but need only be accesses which match one of those active streams. I don't know whether it may be intentionally ambiguous whether the allotment of 32 is per physical core or per CPU socket. The comment about counting outstanding requests per core seems to indicate the former. However, I noticed on Ivy Bridge that last level cache effectiveness appeared to decrease when more than 10 threads per socket were running. It seems also ambiguous as to when a stream may be discarded (e.g. when an access occurs which doesn't match one?).

I haven't encountered the number of prefetch streams as being as significant a limiting factor as the number of fill buffers (10 per physical core, ever since Woodcrest).

The recommendation charts offer more categories of applications where they recommend disabling spatial prefetcher before any other. This seems almost self-explanatory due to the possible effect of doubling memory traffic for widely separated read access, as well as the likely thrashing effect when a thread on one core is reading a cache line adjacent to one which is updated on another core. Disabling streamer would appear to be useful only in the case where false sequences are detected too frequently.

It may be important whether your interest is in how the answers relate to application performance tuning, which might make the question more relevant to this forum or the performance optimization one. At -O3, Intel compilers have some heuristics about fusing and splitting loops which relate to optimizing numbers of streams per loop, which may make it useful to arrange your source code so as to allow the compiler freedom to make these adjustments. It has been said that the compiler doesn't try to make adjustments (e.g. in opt-streaming-stores) which would depend on how many threads are running, as that normally is a run-time variable.

Nathan_K_3 · ‎05-25-2016

Tim P. wrote:
The major augmentation in Haswell CPU appears to be the introduction of next page prefetching, to fill in the gap due to the prefetchers discussed here stopping at page boundaries (and lack of a Windows OS feature comparable to linux Transparent Huge Pages).

I hesitated to ask it as part of this question, but does this mean that the prefetcher is actually limited by page boundaries rather than 4096B boundaries? Or are you talking only about TLB prefetch rather than the Streamer prefetch?

Evidently, when up to 32 streams are followed, those accesses can't be in consecutive cycles, but need only be accesses which match one of those active streams.

I'm presuming you mean that the accesses "don't have to be in consecutive cycles" rather than that they can't be? Or is there a problem with having them consecutive?

I don't know whether it may be intentionally ambiguous whether the allotment of 32 is per physical core or per CPU socket.

I often get the same impression when I read Intel's docs. :)

However, I noticed on Ivy Bridge that last level cache effectiveness appeared to decrease when more than 10 threads per socket were running.

Thanks, this is exactly the sort of thing I'm wondering about.

I haven't encountered the number of prefetch streams as being as significant a limiting factor as the number of fill buffers (10 per physical core, ever since Woodcrest).

Yes, since I'm usually optimizing for single core performance, this is my experience as well. But the line fill buffers are definitely per core, which which I'm hoping won't cause a slowdown for multi-core performance. But I had sort of forgotten that they were shared between virtual cores when hyperthreading, so thanks for the reminder.

The recommendation charts offer more categories of applications where they recommend disabling...

I was more wondering about cases where they were disabled on-the-fly by the processor in response to load, and whether both were always disabled together.

It may be important whether your interest is in how the answers relate to application performance tuning, which might make the question more relevant to this forum or the performance optimization one.

I debated between asking here or there, and decided I wanted to focus as much as possible on the hardware side, especially if there were differences in recent generations. The actual code will be tuned at the assembly level, so the compiler optimizations will be interesting only in so far as they indicate the behavior of the underlying processor.

Thanks!

TimP · ‎05-25-2016

The text you quoted states that cache lines can be prefetched only within the same 4096B page. That does leave open the question about whether the boundaries (and maximum stride for prefetch) are extended by use of huge pages, or whether the huge pages benefit only in avoiding TLB miss. You might ask on performance tuning whether anyone has checked into this. My impression is that it's more than just a TLB question, but I didn't investigate thoroughly with performance counters. It could get complicated, including analyzing data in cache TLB miss (which was handled poorly on the original Intel64 CPUs but of necessity became much better).

I have an application I worked on recently which performed 25% better on linux than Windows on a Nehalem CPU, presumably showing the benefit of THP (loops involving large strides speeding up more than 2x). Some of the Windows bottlenecks were alleviated on Haswell, presumably by hardware next page prefetch, others not so much.

I don't believe there's any problem according to how many cycles apart the strided accesses occur (as long as there aren't too many intervening memory events), other than the question about relying on the automatic hardware adjustment of prefetch distance not to shorten up too much when data are accessed rapidly.

As you say, tuning 1 core at a time appears to work well when carried over to multi-thread, as far as per-core limits on streams and fill buffers are concerned.

I suppose if the automatic dropping of prefetch into lower cache levels is done in part to avoid spurious capacity misses, the hardware may stop multiple varieties of prefetch together, thus maybe giving you the incentive to disable the one which doesn't help your application, in the hope of keeping the other in force longer.

I'd hate to be coding at low level in an application which is pushing the boundaries of numbers of streams and might need adjustment for various target CPUs.

McCalpinJohn · ‎05-26-2016

The L2 prefetchers are not going to be able to prefetch outside of a 4KiB page because that is the limit of contiguous memory for 4KiB pages and because there is no way for the L2 HW prefetchers to know that an access is to a larger (e.g., 2MiB) page.... If I recall correctly, IBM POWER processors provided some extra information that allowed the HW prefetchers to fetch beyond small page boundaries, but I don't recall how that information was conveyed....

So Huge (2MiB) pages don't change the way the L2 HW prefetchers work, but on gen 1 Xeon Phi (KNC), Huge Pages are very helpful for *software* prefetches (which are dropped on the KNC if they require a page table walk).

I have asked many of the same questions that were posed in the initial note, but answers are very hard to come by. The combination of minimal documentation, extremely complex implementations, and bugs in the hardware performance counters make answering these questions nearly impossible. There are some other forum threads that look at L2 HW prefetches to L2 vs L2 HW prefetches to L3 in the context of a simple single-stream code. Even that case is too complex to understand in detail, so multi-stream cases are even less likely to make sense. In another case, I found that the rate of L2 prefetches was extremely high immediately after the core returned from the 1 millisecond timer interrupt, and then decreased erratically over the next millisecond. Not likely to be comprehensible without the processor RTL....

There are a few pieces of incomplete information available.... For example, it is clear that the L2 streaming prefetcher is a per-core resource that tracks cache line addresses accessed in the N most recently accessed 4KiB pages. It is not clear whether the "32 streams" refers to (1) tracking streams on 32 independent pages, (2) tracking forward and backward streams for 16 independent pages, (3) tracking forward and backward streams on somewhere between 16 and 32 pages depending on whether the pages have both forward and backward streams. In any case it is easy to set up an experiment that accesses "too many" 4KiB pages and flushes the L2 HW prefetch filters, so you get no L2 HW prefetches.