Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Disabling Prefetching on the i5

chetreb
Beginner
2,334 Views

I have been using bit 9 of  0x1A0 Model specific register as a handle to disable hardware prefetching on my Intel core 2 duo. However this handle does not work on my i5. I checked up the Intel's software developer manual and found that bit 9 of 0x1A0 is 'reserved' on the i5. Is there any way I can disable prefetching on an i5.

Any help with this will be immensely useful.

0 Kudos
12 Replies
Patrick_F_Intel1
Employee
2,334 Views
Hello chetreb, Intel has not disclosed how to disable the prefetchers on processors from Nehalem onward. You'll need to disable the prefetchers using options in the BIOS. Pat
0 Kudos
perfwise
Beginner
2,334 Views
Pat, When I disable the "prefetchers" on my SB server part.. what am I disabling? That's never been clearly enunciated. Can you specify that for us? If you can list the prefetchers and which are disabled? Thanks Pat.. Perfwise
0 Kudos
TimP
Honored Contributor III
2,334 Views
http://stackoverflow.com/questions/6662140/hardware-prefetching-in-corei3 tells you where to look in the architecture manuals. "Hardware Prefetcher": which detects access at a uniform stride (within a page) "Adjacent Cache Line Prefetch": which pairs cache lines for read access "DCU Prefetcher": for prefetching to L1 data cache "IP Prefetcher": for prefetching to instruction cache The Sandy Bridge was designed specifically to eliminate some earlier dependencies on twiddling prefetcher settings. Applications which have multiple threads updating and reading data separated by less than 128 bytes (or possibly those which seldom use consecutive cache lines) are likely candidates for disabling Adjacent Cache Line prefetch.
0 Kudos
Patrick_F_Intel1
Employee
2,334 Views
Hello Perfwise, As Tim mentioned there are 4 prefetchers. They might have slightly different names. A recent SNB BIOS lists the prefetchers as: MLC streamer MLC spatial prefetcher DCU Data prefetcher DCU instruction prefetcher The Intel optimization guide (http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf) talks about the prefetchers in section in section 2.1.5. The manual will provides a pretty good description of the SNB prefetchers. Pat
0 Kudos
perfwise
Beginner
2,333 Views
Pat/Tim, I tried programming the 0x1A0 MSR address yesterday. I could read and the said bits were not set to 1 (1=disabled) but unfortunately the machine crashed upon touching any bits of that MSR. I'll let you know if I have any more success. I did run some other tests though, where I sequentially stride by a fixed amount and at each stride do either 1, 2, 3 or 4 16B loads of the different 16B slots in a cacheline i touch. I found that .. the # of allocations was excessive, 2x what it should be. I believe this is due to the next line prefetcher getting activated. I don't suppose you can tell me under what circumstance that happens? Thanks anyway though. Last question, does disabling the prefetchers in the bios disable "all" or just "some" of the hw prefetchers in Server SB? That would be useful to know. Perfwise
0 Kudos
TimP
Honored Contributor III
2,333 Views
According to my somewhat limited understanding, when the adjacent line or spatial prefetcher is enabled, it should trigger a prefetch of the paired cache line for any prefetch for read (not for write or read for ownership). On my Sandy Bridge, all 4 prefetch options are presented on the BIOS setup screen. Several OEMs have their own idea of which ones should be available. The MLC Streamer (used to be called hardware prefetch) and Spatial prefetcher (adjacent line) are probably the most likely to be important. Any which aren't on the menu are likely to be enabled by default. Certain OEMs have tuning guides about which prefetcher options should be set for various classes of application.
0 Kudos
McCalpinJohn
Honored Contributor III
2,333 Views
I have not done exhaustive testing of this particular issue, but in my latency testing experiments on Westmere (Xeon 5600) and Sandy Bridge (Xeon E5) I noticed that if I *only* load odd-numbered cache lines or even-numbered cache lines, the adjacent-line prefetcher is not activated. On the other hand, if I load some odd-numbered lines and some even-numbered lines, the adjacent-line prefetcher is activated. I think this is true even if I never explicitly load adjacent lines in an odd-even pair (though I could be misremembering this detail). Fortunately the adjacent-line prefetcher is one that I can disable via the BIOS on my boxes. On my Xeon 5600 (Westmere) systems I don't have a BIOS option to disable the L1 streaming prefetcher, but it is relatively easy to outsmart since it only understands simple ascending address sequences (see Intel SW Optimization Guide, document 248966, section 2.1.5 in version 026).
0 Kudos
perfwise
Beginner
2,333 Views
jdmccalpin, What I've tested are 2 types of loops: a) where the access pattern is completely unrolled, b) where the access pattern is contained in a small loop, to test the effect of the stream and the rip based prefetcher. What I've observed is that for "single accesses to a line" in the unrolled case, for any stride, I only see 1 allocation into the L1D per line. For patterns with 3 or 4 requests per line between strides, so 0, 16, 32, 128, 144, 160, 256, ... I see 2x the allocations into the L1 than I would expect from the test's construction. So to me.. it seems I'm getting some next line prefetcher activated or a streaming prefetch into the L1D because of 3 or more consecutive accesses to the same line. Any feedback anyone on this. There's no public documentation about the HW prefetch hit and miss rates in the SB/IB arch unfortunately and they don't seem to be entirely reliable if you use those from NH, which I do. So to understand: * L1 HW pref * DCU prefetcher: does it only fetch the next line.. or does it try to get further ahead of the pattern. This must be what I'm triggering when I have multiple requests in ascending order to the same line. * rip based prefetcher, but it cuts off at jumps in rips of 2000B I believe. Also.. no documentation on how many rips it can track. Any help with that? * L2 pref * does it always bring in the next line to HW pref requests. I imagine if a stride of 512 B were found it wouldn't be benefiical to bring in the next line after that, at 576. Seems the spatial prefetcher is only useful if it's tracking a stride pattern of say < 128B. Right? Thanks.. Perfwise
0 Kudos
perfwise
Beginner
2,333 Views
Tim/Pat, I've learned something else new.. maybe you can confirm on your end. If I have a loop, with a simple striding pattern, but there's only 1 load in that loop, which strides by 64B and hits in the L2, the HW prefetcher which is RIP based, does not prefetch this. However, if I have 2 loads in the loop, that stride by 64 B, a simple pointer chasing scheme where the 2nd load is 64B ahead of the first LD and the next loop iteration the first load is 64B ahead of the 2nd load in the previous loop iteration, then the L1 HW pref does catch on to this. Any ideas why? I'll continue investigating.. but this behavior really confounded me for a while. Any clues as to why the perf is like this would be appreciated. If you like I'll post a test illustrating what I'm talking about. Perfwise
0 Kudos
perfwise
Beginner
2,333 Views
Tim/Pat, Do you need a test provided by me to determine why on a single load in a loop, which strides by 64B or 128B etc, the L1 rip prefetcher doesn't bring it into the L1 when it hits in the L2? I can provide this test, but I find it puzzling why in this case neither my SandyBridge or IvyBridge systems don't prefetch this. If I put 2 load, both striding in the loop, it catches that case. If I put 1 load and then another which just loads the stack location (static, non-striding) again the striding load isn't caught. Thought this was interesting.. Perfwise
0 Kudos
Patrick_F_Intel1
Employee
2,334 Views
Hello Perfwise, I probably don't have much time to investigate this. One of the challenges of trying to study prefetchers is that the events become timing dependent. For example, perhaps the loop where you have just one access per cache line is fast enough that the line is fetched (by the explicit load) before the prefetcher can complete the fetch? Given a lack of time, I try to address issues where something is wrong... or as folks use to say to me "is there a real world application which goes slower due to x"? When you say "L1 rip prefetcher doesn't bring it into the L1", how are you figuring that the prefetcher isn't getting the data using an event? Pat Pat
0 Kudos
Roman_D_Intel
Employee
2,334 Views

This new article from Intel describes MSR prefetcher control on Intel Core i5 and other recent processors.

Roman

0 Kudos
Reply