How to control the four hardware prefetchers in L1 and L2 more flexibly?

hiratz · ‎02-24-2017

Hi Intel Experts,

Now I'm doing some research work related to Intel's prefetchers. I can write some parameters to a MSR (its address is 0x1a4) to turn on/off any of available four hardware prefetchers(DCU ip, DCU nextline, L2 adjacent cache line and L2 streamer). But I can't do more without more information. I'll appreciate it very much if you can help answer some questions of mine. These questions have confused me a long time but no clear answers are found in Intel's manuals.

1 Does L1 Icache itself have prefetcher? (I know ip and nextline is associated to L1 Dcache). Does TLB have prefetcher?

2 What trigger the prefetchers to work, miss or access?

3 As we all know, many academic prefetch papers used the concept of "prefetch degree" and "prefetch distance". Are there also such parameters in Intel prefetchers? If there are, can we adjust them by writing some parameters to some MSRs? This is meaningful to ip and stream prefetcher. According to my understanding, the degree of nextline and adjacent prefethcers should be 1, but I don't know the degrees of ip and stream prefetchers.

4 If there are no fixed prefetching degree, is it possible that the dynamic degree is used in the prefetchers, or some other dynamic control is used?

5 For streamer prefetchers, is there any performance events to collect the number of streams detected?

6 I want to make statistics of how many cache lines are brought into cache (l1, l2 or llc) by prefetch requests and demand requests respectively? How much percentage of hit cache lines are from the prefetching? In other words, I want to get the "prefetch accuracy" and "prefetch coverage" which are academic terms. But all current available performance events related to prefetchers can't give any such information.

Thank you in advance!

hiratz · ‎02-24-2017

3 more questions:

7 I know there is a MSHR called line buffer buffer (LFB) in L1. Are there also any such MSHRs in L2 and LLC? If there are, how many entries are there for each of them? And after the L2's prefetchers generate the prefetch addresses and before they send these prefetch addresses to next cache level (LLC), do the prefetchers need to check MSHR and L2 cache to filter those address which are already stored in MSHR( pending miss/pending prefetch request) or in the cache?

8 If I want to bring the prefetched lines by L2's prefetchers into L2 directly without storing them into LLC, that is, bypass LLC, are there any mechanism to do so? I guess not because LLC is inclusive.

9 If I understand it correctly, the IP prefetcher is similar to a stride prefetcher used by the academia, right? For nextline and adjacent prefetchers, I'm still confused with their difference (next and adjacent look the same) because it looks like that they do the similar things: pfefetch the neighbouring cache line. Or does adjacent also consider the negative access direction?

McCalpinJohn · ‎02-27-2017

There is not much formal documentation of Intel's prefetchers. Most of the information that is available is implicit in various documents, the most important of which are:

The Intel Optimization Reference Manual (document 248966, the latest edition I know of is revision 033, dated June 2016).
- Information is scattered through the whole document, but is mostly high-level comments, rather than implementation details.
- Chapter 2 provides some useful model-specific comments concerning the number of pages that the HW prefetchers can track.
- Chapter 7 (e.g., Section 7.5.3) provides some measurements of HW prefetch behavior on different processors.
- Chapter 16 (16.2.8.2) provides some information about how the Xeon Phi x200 (Knights Landing) hardware prefetchers behave.
Volume 3 of the Intel Architecture Software Developer's Manual (document 325384, the latest revision is 060, dated September 2016).
- Information is also scattered throughout this document, but Chapters 4, 11, 17, 18, 19, and 35 have the most relevant information.
- The details of the transaction types for the OFFCORE_RESPONSE events described in Chapter 18 are important.
  - It can be useful to compare and contrast the bit descriptions for all processor generations that support these events.
- The details of some of the other performance counter events described in Chapter 19 are important.
For the Xeon E5 processors, there is a fair amount of additional information implicit in the "Uncore Performance Monitoring Guides" for the four generations of processors that have been released.
- These documents are rarely updated, and they definitely have some mistakes, so it is a good idea to look at all four of them together.
  - Xeon E5 v1 is document 327043
  - Xeon E5 v2 is document 329468
  - Xeon E5 v3 is document 331051
  - Xeon E5 v4 is document 334291
  - Some of these are not easily found by document number, so searching for "uncore performance monitoring" on intel.com may also be required.
- Most of the interesting information is implicit in the "opcode match" for the CBo events, the HA events, and the QPI Link-Layer events.
Intel has a lot of patents in the area of prefetching technology.
- The big problem with these is that there is no way to know whether the invention described in a particular patent is actually used in a system.
- The patents are mostly helpful for providing help with the vocabulary used in the references above, and for broadening one's ideas about how the prefetchers might be implemented.

For Xeon and Xeon Phi x200 processors, I don't know of any controls that are more fine-grained than the "on" and "off" controls in MSR 0x1a4.

Everyone is free to experiment with systems and draw conclusions about how the prefetchers work. Some of my speculations appear in various Intel Developer Forum posts. You may be interested in the posts at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/592464. ; In the experiments I discuss there, I found that both the Xeon E5 v1 (Sandy Bridge) and Xeon E5 v3 (Haswell) generate L2 hardware prefetches based on sequences of *L2 accesses*, and don't require any *L2 misses*.

The dynamic nature of the hardware prefetchers makes it extremely difficult to draw general conclusions from performance counter measurements:

The hardware prefetchers restart after every 4KiB page. They start from nothing, get very aggressive, and then stop -- all in every 64 cache line range.
The L2 hardware prefetchers bring some lines into the L3 and some lines into the L2. The decision is based on a variety of obvious factors that are mentioned in some of the documents above, but even if we knew the exact formulas used it would not help, because the inputs to the formula (such as the instantaneous number of transactions in flight of each class, or the weighted running average number of transactions) is not visible to users.
- An example of this is provided in the post I referenced above -- the L2 hardware prefetchers are more aggressive after returning from the scheduler interrupt handler, and get less aggressive as my code ramps up the number of demand loads that hit in the L2.

hiratz · ‎02-27-2017

Thank you for your detailed answers, John. You last word makes me feel a little sad: "The dynamic nature of the hardware prefetchers makes it extremely difficult to draw general conclusions from performance counter measurements". This means I can't do much based on current Intel's prefetchers system ... Anyway, I will read your experiments and check the documents you mentioned and try to find some useful information.

Three more quick question about performance events (mainly from chapter 19 in "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2 Order Number: 253669-059US June 2016)

1 I noticed that some events which appeared in earlier microarchitecutre like Core, Nehalem, Sandybridge disappeared in recent microarchitecutre like haswell, boradwell and skylake. I did some experiments on Broadwell/Haswell with some events which are only showed in the chapters for earlier microarchitecture and got some values which looks like reasonable.

So are these earlier events made invalid or just not documented but still useful in later architecture?

For examples:

For the event "4EH' group: there are three items whose masks are 01H 02H and 04H respectively in Chapter 19.7(Nehalem), but only one items appear in 19.6 (SandyBridge) and this item's meaning is the same as its counterpart in Chapter 19.7. For chapter 19.4(Haswell), 19.3(BroadWell), 19.2(Skylake), I don't see any events marked "4EH". So I'm not sure I can still use them. Such events also contain "F1H", "F2H", and so on. For event "F2H", I found the mask bit definition in Chapter 19.5 (Ivybridge) is different from the definition in Chapter 19.4 (Haswell). Moreover, this event just disappeared in Chapter 19.3 (Broadwell) and 19.2 (Skylake).

Another example is the event MEM_LOAD_RETIRED.L1D_LINE_MISS and MEM_LOAD_RETIRED.L2_LINE_MISS (they are different from MEM_LOAD_RETIRED.L1D_MISS and MEM_LOAD_RETIRED.L2_MISS). These two events are only showed in Chapter 19.10 (core micorarchitecture) and never mentioned in later architectures. These events give the real number of miss requests sent to the bus between cache levels (for the later misses which hit a previous pending misses in the mshr, these events won't be incremented) and are very useful. Unfortunately I don't know whether Intel have deleted these events in the recent microarchitetures.

2 After a series of observations and experiments, I can guess and verify that the mask of event "24H" has the following bit distribution:

Event 7 6 5 4 3 2 1 0
24H l1-pf-en? hit miss l2-pf l1-pf instruction store load

But I am not very clear about the function of bit 7. As you can see, I just guess bit 7 is related to l1D prefetchers' control. It looks like a enable bit but I'm not sure. Can you tell me this if you know? Or can you spend a little spare time testing it when you are not busy? Thank you!

For the mask bit distribution of "F2H", my verified results are as follows:

Ivrbridge and before: 7 6 5 4 3 2 1 0
l2-pf-dirty l2-pf-clean demand-dirty demand-clean (quite sure)

Haswell and after: 7 6 5 4 3 2 1 0
l2-pf demand dirty clear (not very sure)

3 Some events look like unreasonable. For example, "L2_TRANS.L2_WB" is described "L2 writebacks that access L2 cache.". This is weird. How can that happen?! The writebacks generated by L2 should access L3 cache, not L2 cache. Or did I misunderstand this?

Thanks

Best

McCalpinJohn · ‎02-27-2017

It is often hard to tell why Intel drops the documentation for some performance counter EventSelect or Umask values.

In some cases it is because the event is still there, but does not work properly on the newer system. Sometimes it did not actually work on the processor for which it was documented, but the documentation was only updated for the newer processors. (I think that some of the bits in the OFFCORE_RESPONSE auxiliary registers may fit into this category.)
In some cases the event has actually been removed -- such as the 0x10 and 0x11 floating-point events from Sandy Bridge and Ivy Bridge that were disabled on Haswell.
In some cases the event has been changed, but the new event has not yet been disclosed. For example, it is common for vendors to implement features in processors that are not disclosed or made accessible until a later processor generation. This gives the vendor a much stronger path to getting the feature fully debugged by the time it is made accessible. I have seen some cases where it looks like a performance counter event is being withdrawn for a generation before the number is re-used for a new purpose in later processor.

I never looked at the 0x4E ("L1D_PREFETCH") events from Nehalem/Westmere. The one event that remains in Sandy Bridge does not look very useful, since it does not distinguish between new misses and LFB hits.

Sometimes the encoding of events is significantly changed on purpose. This is particularly clear in the encoding of the 0x24 L2_RQSTS events on Sandy Bridge vs Haswell. My table for Event 0x24 is similar to yours -- I don't understand exactly what bit 7 of the Umask is for either (and I am not too sure about bit 3).

The L2_TRANS.L2_WB is not confusing -- an L2 WB to L3 has to read the L2 data array to get the data, so that is certainly an access. The trickier case to understand is an L1 WB that misses in the L2. This probably does not happen very often, but it is certainly allowed by the non-inclusive L2 architecture.

hiratz · ‎02-27-2017

Your answers are so helpful to me, Thanks again! It looks like that I need to use the events more carefully.

One reason why I want to check the umask encoding of event 0x24 L2_RQSTS is that I found the event "L2_RQSTS.ALL_PF" didn't give the correct numbers as it describes ("Counts all L2 HW prefetcher requests."). I thought it should just give the prefetch requests from L2 prefetchers (not including those from L1 prefetchers). After a series of tests, I found the value of "L2_RQSTS.ALL_PF" is more than the sum of "L2_RQSTS.L2_PF_HIT" and "L2_RQSTS.L2_PF_MISS" after I turn on any of two L1D prefetchers. However after I turn off two L1D prefetchers, its value is equal to the sum of L2_PF_HIT and L2_PF_MISS. Therefore, I guess that probably bit 3 is used to count the number of requests from L1 prefetchers.

If I remember it correctly, Intel's L1 and L2 should be non-inclusive/non-exclusive architecture and L3 is inclusive one which contains all the cache lines existing in the L1 and L2. If so, it probably happens that a L1 WB misses in the L2.

Best

McCalpinJohn · ‎02-28-2017

Interesting results -- thanks!

I did a set of experiments in which I read 1, 2, 3, 4, 5, ..., 64 cache lines from a set of 4KiB pages and measured the L2 demand and prefetch counters for each case. The results were mostly monotonic, but very sharply nonlinear. If I only accessed a small number of lines in each page, the number of reported L2 HW prefetches was near zero, then it suddenly jumped to being about the same as the number of lines that I actually read. The L2 HW prefetches included both "prefetch to L3" and "prefetch to L2", with a split that did not make any obvious sense. This is not surprising, since I was not really able to control the overall level of L2 and L3 "busyness" during these experiments, and Intel's documentation suggests that this is an important factor in the L2 HW prefetcher heuristics.

It is easy enough to get the average values for access to large numbers of pages (e.g., the STREAM benchmark), but when I get different ratios of transactions types in a more complex code, I don't see any way to derive any useful information from these differences. In particular, it is not obvious whether changes in the ratios of transaction types should suggest code changes that may improve the overall throughput (either via improved latency hiding or decreased ineffective HW prefetching).

One case that does produce actionable information is when the code generates (almost) no L2 HW prefetches. We ran across this in a complex Lattice-Boltzmann code that was accessing something like 50 arrays in each inner loop. This was too many 4KiB pages for the L2 HW prefetcher to track, so by the time the code code back around to the first array, the L2 HW prefetcher had no memory that the page had been previously accessed, so it generated no HW prefetches. We found that by fissioning the loop into smaller pieces (each accessing 16 or fewer different 4KiB pages), we got the L2 HW prefetches back. The fissioned version did have to re-load some arrays, but the improved bandwidth from the L2 HW prefetchers more than made up for this extra overhead.

Stephanie_L_ · ‎05-23-2018

I would like to understand the impacts of setting a power cap on memory performance. I am trying to collect last level cache misses and references, and see how it impacts prefetching. Are there additional counters I should be monitoring to understand the behavior of prefetching under a power cap?

Thank you in advance!

TimP · ‎05-23-2018

I think I'll need to send this interesting discussion to a printer when I get near one. I take DCU and L1D to be synonyms and find switching terminology slightly confusing. If you run on linux with an application which takes advantage of transparent huge pages, much of what was said here will be affected. Is it evident this is not a factor in this discussion? Does Skylake retain the relatively recent next page prefetcher, which I assume has a TLB prefetching role, particularly when on windows where there is no THP? As hinted above, dynamic hardware prefetch has a steadily increasing prefetch distance which we can't control. I suppose skylake servers may differ more from client CPU than in the past.