Single Threaded Memory Bandwidth on Sandy Bridge

Nathan_K_3 · ‎10-18-2013

With the help of John McCalpin's comments (http://software.intel.com/en-us/forums/topic/456184), I'm finally starting to understand why Sandy Bridge performs as it does on memory benchmarks such as Stream. As I understand it, each outstanding L1 cache miss occupies a Line Fill Buffer (LFB). Since each Sandy Bridge core has only 10 LFB's, there can only be 10 memory requests in flight at any time. According to one formulation of Little's Law (Bandwidth = Concurrency x Latency) if we assume a fixed latency this limit on concurrency puts an upper limit on memory bandwidth.

To test this theory, I started doing measurements on the simplest benchmark I could think of: a single-threaded single-stream read. To avoid being confused by compiler optimizations, I wrote the inner loop using x64 SSE inline assembly, and verified that the generated code matched my expectation. Once I realized that each load request actually loads 2 cache lines (128 bytes) the measured read bandwidth of ~16GB/s made more sense:

128 Bytes/LFB * 10 LFB's = 1280 Bytes in flight / 16 GB/s = 75 ns latency, which seems plausible.

Reducing the size of the array, the measured bandwidth from L3 was about twice this at ~32 GB/s. At first this was confusing, because plugging this in to the formula would imply that L3 has a latency of ~35 ns, instead of the ~30 cycles (~8 ns) I expected. But since transfers from the L3 Ring Bus are 32B rather than 128B, I was also able to make this work:

32 Bytes * 10 = 320 Bytes in flight / 32 GB/s = ~9 ns, which is close enough to make sense.

Reducing the array further so it fit in L1, I measured about twice the L3 bandwidth: ~64 GB/s. This also seems to fit, as each SSE vector is 16B and reading a vector from L1 on Sandy Bridge should take 7 cycles:

16 Bytes * 10 = 160 Bytes in flight / 64 GB/s = ~2 ns = ~7 cycles, which seems remarkably close.

The fits seem almost too good to be true. I'm left with lots of questions.

First, does this summary seem correct? Am I missing something major, like other limits to concurrency? For example, Is 10 the right number of LFB's? Is my presumption about 32B L3 transfers correct? Is the L3 -> register bandwidth actually constrained by the number of stops on the L3 Ring Bus rather than LFB's? And is it really the case that although you can issue two vector loads per cycle, you'll never be able to sustain 32B/cycle on Sandy Bridge even if the data is all in L1?

Then, my next questions would be about the usage of the LFB's. Is it correct that they are not consumed by hardware prefetches? That they are used for all L1 cache misses, and not just L3 cache misses? Are the LFB's also consumed for software prefetches, or can software prefetches be used to skirt the concurrency limit? Are there any other approaches that can be used to get the data from RAM to L3 or L3 to RAM without using up these buffers?

Finally, is there an ordering to access RAM that would reduce the latency further? On John's blog (http://blogs.utexas.edu/jdm4372/2010/11/09/optimizing-amd-opteron-memory-bandwidth-part-4-single-thread-read-only) he describes trying to read from already open RAM pages, but I'm not sure how much of this is AMD specific. Is there an ordering for Intel that maximizes open page accesses? The best information I've found for Intel is also from John (http://software.intel.com/en-us/forums/topic/393131) but I'm not sure how to apply it.

Thanks!

McCalpinJohn · ‎10-23-2013

For some reason the spam filter does not like me and I have been unable to post a response.

Sorry about that.

The (short) answer is that the L2 prefetchers bring the data in early, so the Line Fill Buffers don't need to be occupied for the full memory latency.
Shorter occupancy per transaction leads to more transactions per unit time.

My best results were obtained with a version that divided the loads across two independent 4 KiB pages. The resulting 17.5 GB/s corresponds to about 21 cache lines "in flight" at the nominal latency of 77 ns on my Xeon E5-2680 systems.

perfwise · ‎10-24-2013

John,

Yeah, the hw pref in the L2 latch upon a simple stride by 64 or 128 pattern, which stream does.. but they don't latch upon patterns with other strides. The L2 pref are targeted towards buffering sequantial strides/streams.. while the L1 pref.. is targeted towards strided patterns primarily. It can't handle patterns unrolled which are streaming but don't have rips which stride or have more rips than the rip L1D prefetcher can track (don't know what that number is). As to memory bandwidth Intel achieves 85% of peak on SB/IB and 90% of peak on HW. The nominal page open latency is ~62 ns on desktop 1333 parts (from the core). From the memory controller it is less.. maybe 9ns less (since the L3 memory latency is ~9ns).. and that translates to ~80 clks at the memory freq (0.667 GHz). The memory tech delivers a bandwidth of 32B per MCLK, so the # of outstanding tokens you need to keep memory running (ignoring the turnaround inefficiencies in how requests are sent and returned on the memory bus) is 80 clks * 1 cacheline/2 clks * 1 token/cacheline= 40 outstanding tokens to buffer the latency. So you not only need to keep the L2 pref busy.. you need the appropriate buffering to do so. Thought I'd put this out there.. since it's equally or more important since just having a hw pref.. doesn't imply bw.

BTW.. I ran my stream code on IB.. and I achieved 18.03 GB/s (base 10 GB here) with 1 thread of STREAM using 16B SSE movups for load and movntps for ST and sweeping through 4GB. This is some auto-parellel'd version of stream on your web site in the past which i then went in and tweaked the thread assembly to do the proper thing. NO special 4K access pattern was required to achieve that. On 2T.. I get 18.9 GB/s of bandwidth.. which is 88.8% of theoretical peak on IB.

perfwise

TimP · ‎10-24-2013

I suppose the line fill buffers shouldn't be a big factor in an optimized stream benchmark. If you use a non-Intel compiler which would require you to write intrinsics to get non-temporal stores, the temporal writes you get without intrinsics cause a fill buffer to be populated with the old data ("read for ownership") and updated with the new data. As you have only a single write stream, there is no problem with running out of buffers; a completed buffer can be flushing while you work on new buffers.

The 10 fill buffer limitation becomes a problem for applications which write more than 8 or so separate streams but don't fill an entire cache line on each pass through the (possibly unrolled) loop. With the 512-bit wide architectures, where a single store ideally fills an entire cache line, the dependence on fill buffers presumably is reduced.

Read-only data shouldn't deal with fill buffers. As mentioned above, a stream benchmark would be expected to trigger both hardware strided prefetch (which stops at a page boundary), and companion cache line prefetch, giving something of the effect of doubling the cache line size as far as prefetch is concerned. This post

http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad

discusses the effectiveness of huge pages and the requirement to use arrays at least several times the total cache size to get meaningful results.

Nathan_K_3 · ‎10-25-2013

TimP wrote:
I suppose the line fill buffers shouldn't be a big factor in an optimized stream benchmark. ... Read-only data shouldn't deal with fill buffers.

This seems at odds with John's tests, which I interpret as saying the limited number of LFB's is the main limiting factor for single thread performance on Stream. If not LFB's and their occupancy time, what do you think causes single thread performance to be what it is?

I'm often feeling that the deeper I go down this hole, the less I know. On the bright side, I do seem to be squeezing out slightly better performance from my quasi-random fumbling. Here's some single-threaded non-OpenMP numbers for Stream 5.10 on the 3.6 GHz Sandy Bridge E5-1620 I'm working on (ICC 13.0.1 on Linux, hyperthreading off, turbo off, 4 x 8GB quad channel at 1600 MHz).

I started here: icc -mavx -Wall -O2 stream.orig.c -o stream

Function Best Rate MB/s
Copy: 13927.3
Scale: 13966.5
Add: 16442.9
Triad: 16341.2

And currently am here using inline assembly:

Copy: 17765.4
Scale: 17825.4
Add: 20226.2
Triad: 20006.8

Much of the benefit came from fiddling with the starting alignment of the buffers. For my modified algorithm, using a start address for each buffer aligned to an odd numbered cacheline (that is, divisible by 64 bytes but not 128 bytes) is close to optimal. Addresses ending with 0x00 are slow, but ending with 0x40 are fast. I'm sure the details are extremely implementation dependent, but I was surprised by how large the effect was: probably 20%.

Prefetching is important, but you want to do as little of it as possible. For reading single streams (Copy, Scale) this worked out to 1 prefetch every 128 bytes. For reading two streams (Add, Triad) it was one prefetch every 64 bytes. Possibly because the companion line prefetch is disabled under heavy load? Prefetching to T0 was slightly faster than NTA, and both of those were slightly faster than T1 or T2. Optimal prefetch distance for me was 500-1000 bytes ahead of the main read, with performance dropping either closer or farther.

Nontemporal stores almost always came out ahead. This was mostly expected. The surprise factor to me was the utility of LFENCE (Load Fence). Issuing 1 (or 2) prefetches and 8 loads, followed by LFENCE, followed by the nontemporal stores often worked best. I don't know whether this corresponds to Line Fill Buffers or something else, but it made about as much difference as prefetching. I'm guessing it's related to competition for the memory addressing ports between the stores and the loads?

Using simpler addressing schemes helped a bit. Update your base pointer instead of using base + index * scale array-type notation. Series of instructions like "movupd -0x10(%rax),%xmm1" worked better than "movupd -0x10(%rax, %rcx, 8), %xmm1". Working backwards through memory (from the end of the buffer to the start) seemed to help. Maybe this helps to discourage an unhelpful hardware prefetch? Separately, it also allowed a slightly faster loop using the flags from "sub" for "jnz" to avoid a "cmp". Putting "sub" and "jnz" adjoining to allow micro-op fusion had a tiny gain as well.

I saw no performance difference between "movapd" and "movupd". For Copy, "movupd" vs "movdqu" made no difference. When used with XMM registers and the two argument form, "movupd" was no different than "vmovupd". I have not yet tried 256-bit YMM or 3-arg VEX. I didn't see any difference between separate Load and Math operations versus using memory operands, but I didn't explore this fully. It's likely this could produce improvement.

Instruction ordering mattered, but not in a way that was easily predictable. Is it better to load all of "a" then all of "b" and then do the math? Or to interleave? I'm not sure. A partial interleave seemed best in most cases. Reusing registers was also confusing. Sometimes 4 XMM registers reused was faster than 8 used once, and sometimes it wasn't. Using all 16 XMM registers for Copy and Add was not helpful in the cases I tried (and was difficult because ICC throws up when you try to do it --- GCC and Clang were fine).

The "inline assembly" approach was problematic, and only worked well as all or nothing. If you gave the compiler the slightest excuse, it would figure out a way to foul you up. If you slipped up and tried initialized a variable at declaration ("double *a = b") it would figure out how to spill 3 registers to the stack and not still initialize the register you wanted. Intrinsics don't have a chance, as the compiler reorders them in absurd ways just for giggles. Straight assembly would probably have been easier, but I think I now have something that appears to work for GCC, ICC, and Clang (all on Linux).

I'm pretty sure there is room to go faster, but I still don't feel that I have enough understanding of what's happening under the lid to make any systematic improvements. The idea that you always want to be reading and writing to open pages of RAM makes sense, but how to actually do so? And are Line Fill Buffers really the limiting factor? I think what I'm finding is consistent with that, but I still have a lot of uncertainty as to the details. I'd love more info and hints if anyone has them to offer.

(relevant portion of asssembly from 'objdump' attached)

--nate

McCalpinJohn · ‎10-26-2013

I don't quite understand TimP's response, but here is an attempt to bridge.

According to the Intel Optimization Reference Manual, section 2.2.5.2, the 10 Line Fill Buffers in the Sandy Bridge core are used to handle ordinary cached loads, ordinary cached stores, and streaming (non-temporal) stores. Since all traffic in and out of the L1 cache must use these buffers, the duration for which the buffers are occupied for each transaction is critical for performance.

It can be hard to distinguish cause and effect in buffer occupancy analysis -- that is why I usually start with the simplest possible case (e.g., only reads and no hardware prefetching) and slowly add complexity.

If you start from the more complex situation and then simpify it, you may come to different conclusions -- or you may express your conclusions differently.

Here is an example contrasting two approaches:

(1) I started by assuming no L2 HW prefetch. In that case, the available concurrency is limited to 10 outstanding cache line transactions per core, and this provides a sustained bandwidth limit of 10*64 Bytes per average queue occupancy. If we assume that the average queue occupancy is approximately equal to the memory latency, then we get numbers like 640 B / 77 ns = 8.3 GB/s. Since this is only 16% of the peak DRAM bandwidth of 51.2 GB/s (4 channels * 8 Bytes/channel * 1.6 G transfers/sec), none of the other possible DRAM performance limiters are plausible explanations of the limit. If you disable the L2 hardware prefetchers on your system, you should get similar results. (Or possibly lower results -- this is an upper bound on performance and there are lots of ways that you can generate code that will go slower!)

(2) If one starts by assuming that the L2 prefetchers are active and aggressive, then most of the data read by the STREAM benchmark kernels will be found in either the L3 or L2 cache. The SW optimization guide lists the L3 latency on Sandy Bridge as 26-31 cycles. Assume 31 cycles and a Xeon E5-2680 running at a Turbo frequency of 3.1 GHz to convert this to 10 ns latency. If the LFB occupancy is approximately equal to this latency, then the 10 LFBs are sufficient to manage 640 B / 10 ns = 64 GB/s. This is a lot of bandwidth, and it is based on the assumption that all of the prefetched data ends up in the L3 with none in the L2. According to section 2.2.5.4 of the SW Optimization guide, both the L2 "spatial" prefetcher and the L2 "streamer" prefetcher will bring the data into the LLC, and will also bring the data into the L2 "unless the L2 cache is heavily loaded with missing demand requests". In practice, this means that you have to use hardware performance counters to determine where the data is actually placed for any given test. If most of the data actually ends up in the L2, the latency will be closer to 12 cycles, or roughly 4 ns, giving a concurrency-limited bandwidth of 10*64/4 = 160 GB/s. This is quite a bit higher than the 100 GB/s (3.1 GHz * 32 B/cycle) peak bandwidth of the L2 cache, so performance will be limited by other factors, not by the number of L1 LFBs.

I was using the L1 LFBs to illustrate the point that single-threaded bandwidth is concurrency limited, not because I am sure that L1 LFB occupancy is the overriding factor in STREAM performance (when L2 HW prefetch is enabled), but because the L1 LFBs are documented well enough to allow a clear and quantitative discussion. In contrast, the L2 HW prefetchers are less clearly documented and have behavior that is dependent on the instantaneous state of the system, so it is not possible to predict in detail.

In a normally-configured system, it is clear that single-threaded memory bandwidth is concurrency-limited, but it is not clear exactly what the L2 HW prefetchers are doing. I don't know exactly how many L2 HW prefetches are being generated, I don't know how the "spatial" and "streamer" prefetchers interact for code with contiguous accesses (like STREAM), and I don't know how many of the prefetches are being put in the L2 cache and how many are being put in the L3 cache. Since I know that the actual DRAM latency is about 77 ns, I can multiply the observed bandwidth by that latency to get an "effective" concurrency. The best results I got with my "ReadOnly" benchmark were in the 17.5 GB/s range, which corresponds to an average about about 21 cache lines "in flight". Obviously, there can be no more than 10 L1 Data Cache misses in flight, but I can't tell how much the L2 HW prefetchers have reduced the average LFB occupancy, so I don't know whether the number of LFBs is the limiting factor in single-threaded STREAM bandwidth. (Some of these questions can be answered with hardware performance counters.)

A few added complexities that I have probably mentioned before, but bear repeating:

(a) It is harder to estimate buffer occupancy for streaming stores, since there is no user-visible "response" from the memory hierarchy. A streaming store could be occupied for a very large fraction of the memory latency, or it could be occupied for a relatively short time until it is able to hand off the data to a buffer in a memory controller. If the number of buffers available in the memory controllers is greater than 10 (which seems likely, since there are 4 memory controllers to interleave across), then the LFB occupancy for a streaming store should remain low.
If I recall correctly, the performance of a pure streaming store loop is much higher on my Xeon E3-1270 system (4 Sandy Bridge cores with the "client" northbridge -- unified L3 and 2 DRAM channels) than on my Xeon E5-2680 systems (8 Sandy Bridge cores with the "server" northbridge -- partitioned L3 and 4 DRAM channels). It is easy to believe that the store path has lower latency/occupancy on the smaller chip -- especially since it is a single-socket design that does not have to worry about coherence with another chip.

(b) It is important to remember that both the L1 and L2 hardware prefetch engines work only within 4 KiB pages. Since the memory latency times the peak bandwidth is almost exactly 4 KiB (77 ns * 51.2 GB/s = 3.85 KiB), the "startup" behavior of the prefetchers is critical and the easier-to-analyze "asymptotic" behavior may not be particularly relevant. This is problematic because the "startup" behavior of the various hardware prefetchers is not documented in detail. It does suggest that if software prefetches are going to be useful, they will be most useful at the beginning of each 4 KiB page, and will almost certainly only get in the way once the hardware prefetchers have ramped up. That implies a lot of unrolling, and a lot of special cases for code generation depending on the relative page alignments of the data streams involved.

(c) DRAMs can run at >98% utilization if scheduled properly. (Depending on details, one typically loses 0.5%-1.5% of the DRAM cycles to refresh -- this is unavoidable and with DDR3/DDR4 DRAMs the refresh cycles cannot be overlapped with any other useful DRAM work in the same rank.) "Proper" scheduling involves grouping reads to the same rank (either to one bank or to non-conflicting banks) into relatively large blocks, grouping writes to the same rank (either to one bank or to non-conflicting banks) into relatively large blocks, and executing these blocks sequentially to avoid read/write and write/read turnaround stalls. The hardware tries to do this automagically, but there are no mechanisms available that the user can exploit to help the hardware do it better. It is possible to do this scheduling manually with a single thread, but on current hardware a single thread cannot generate enough concurrency to get close to the DRAM performance limits, so it does not actually help. WIth multiple threads, enough concurrency is available, but the threads have to be synchronized at extremely fine granularity. For example, suppose you wanted to read 12 KiB of array A and 12 KiB of array B into the L1 cache, then perform some arithmetic on the values and store the resulting 12 KiB of array C directly to memory using streaming stores. On my Xeon E5-2680, the chip has 51.2 GB/s bandwidth, or 6.4 GB/s per core. Assuming that we are loading array A from an open DRAM page and arrays B and C might be in different ranks, we have to synchronize after each of these 12 KiB block transfers. Loading the first 12 KiB will take each processor 1.92 microseconds, then all the cores have to synchronize before starting to load the second 12 KiB block, then all the cores have to synchronize before storing the 12 KiB of results. According to my testing with OpenMP_Bench_C_v2, an OpenMP barrier across 8 threads running on one chip takes 0.5 microseconds. This is an overhead of more than 25%, limiting performance to 75% of peak ---- but I already 74% of peak running STREAM with no attempt to optimize the scheduling at all, so it would be a wasted effort.

Nathan_K_3 · ‎10-29-2013

Thanks John, these are wonderful added details.

John D. McCalpin wrote:
Since all traffic in and out of the L1 cache must use [the line fill] buffers

I've been trying to confirm that this is true. Are we certain that non-temporal stores and L1 hardware prefetches contend for the same 10 LFB's? Intel Optimization Manual 2.2.5.4 Data Prefetching says:

Data Prefetch to L1 Data Cache: Data prefetching is triggered by load operations when the following conditions are met:

Load is from writeback memory type.

The prefetched data is within the same 4K byte page as the load instruction that triggered it.

No fence is in progress in the pipeline.

Not many other load misses are in progress.

There is not a continuous stream of stores.

The "not many other load misses" and "not a continuous stream of stores" imply this, but don't actually say it.

I started by assuming no L2 HW prefetch.

I'm frequently confused by which prefetch people are referring to. The changing nomenclature and number of caches doesn't help. By "no L2 HW prefetch" I presume you mean no "Data Prefetch to the L2 and Last Level Cache", and thus no hardware prefetch from RAM at all?

Since this is only 16% of the peak DRAM bandwidth of 51.2 GB/s (4 channels * 8 Bytes/channel * 1.6 G transfers/sec), none of the other possible DRAM performance limiters are plausible explanations of the limit.

I presume your conclusion is right, but the reasoning feels a little weak. A BIOS that doesn't interleave RAM could cause all requests to come from the same channel, reducing the peak bandwidth to 25% off the bat. Add some bank conflicts in because of poor access patterns, and losing another 50% seems plausible.

(2) If one starts by assuming that the L2 prefetchers are active and aggressive, then most of the data read by the STREAM benchmark kernels will be found in either the L3 or L2 cache.

You're likely right, but I'm uncomfortable with 'most'. If the hardware prefetch from RAM must be within the same 4K page, I think this means the Streamer might often be playing catch up. We're consuming data as rapidly as it comes in, and without software prefetching each 4K page can almost be treated as a cold start: Over the course of 5 cycles, 10 loads get issued, and none of the data is in L3. We wait 70 ns (200 cycles) for the first data to be received. Every time an LFB is freed, another load request takes it. Since the docs say that hardware prefetch only occurs if "not many other load misses are in progress", it seems possible that we never escape from this state.

But I guess this can be answered by looking at the ratio of L3 cache hits vs misses? I need to start making more such measurements.

The SW optimization guide lists the L3 latency on Sandy Bridge as 26-31 cycles. Assume 31 cycles and a Xeon E5-2680 running at a Turbo frequency of 3.1 GHz to convert this to 10 ns latency. If the LFB occupancy is approximately equal to this latency, then the 10 LFBs are sufficient to manage 640 B / 10 ns = 64 GB/s.

I just tried measuring a single thread of 16B reads (no software prefetch, all hardware on) of a size that should fit in L3, and got about 22GB/s. Obviously I could be doing something wrong, but have you been able to achieve 64GB/s sustained reads from L3? If not, any theories on what the other limiting factors might be? In my first post, I guessed that it might have to do with the number of stops on the L3 ring bus or the fact that L3 transfers happen in 32 rather than 64B chunks, but neither of these feels very solid.

If most of the data actually ends up in the L2, the latency will be closer to 12 cycles, or roughly 4 ns, giving a concurrency-limited bandwidth of 10*64/4 = 160 GB/s. This is quite a bit higher than the 100 GB/s (3.1 GHz * 32 B/cycle) peak bandwidth of the L2 cache, so performance will be limited by other factors, not by the number of L1 LFBs.

I just measured read only bandwidths again for a loop of 16 MOVUPD's into a sequence of 4 registers. Assembly looks clean. For sizes that should fit in L1 I get about 38GB/s, and for L2 sizes about 30 GB/s. There are all sorts of things I could be doing wrong, but hyperthreading is off and the machine is otherwise idle. What do you get for such a test?

I was using the L1 LFBs to illustrate the point that single-threaded bandwidth is concurrency limited, not because I am sure that L1 LFB occupancy is the overriding factor in STREAM performance (when L2 HW prefetch is enabled), but because the L1 LFBs are documented well enough to allow a clear and quantitative discussion. In contrast, the L2 HW prefetchers are less clearly documented and have behavior that is dependent on the instantaneous state of the system, so it is not possible to predict in detail.

I agree. I think you've made a excellent case for an upper limit on the performance that can be expected with hardware prefetch from RAM disabled. I think you have a pretty good explanation for performance when the companion and spatial prefetch from RAM are active (companion gives you an effective 128B transfer size, and spatial doesn't help for long fast streams). I think the LFB concurrency theory also explains why adding software prefetch provides a real by limited gain (it gets you over the hardware prefetch only within 4K page hurdle). I haven't been able to make it explain performance from data already in L1, L2, or L3 without requiring just-so stories.

(Some of these questions can be answered with hardware performance counters.)

I've been having trouble getting the Uncore counters to work well for me, but I think that's just a local problem. I'll try to start measuring some things to understand this better. Are there particular metrics you think would be helpful?

It does suggest that if software prefetches are going to be useful, they will be most useful at the beginning of each 4 KiB page, and will almost certainly only get in the way once the hardware prefetchers have ramped up. That implies a lot of unrolling, and a lot of special cases for code generation depending on the relative page alignments of the data streams involved.

This seems worth trying, although I'm not sure about the principle. My thought had been that the hardware companion line prefetch is being activated by the software prefetch, bringing in 128B from RAM per software prefetch issued. But once I moved to adding stores, I think the companion line prefetch stopped and I had to prefetch in 64B increments instead. As the docs seem to say, I'd worry that the Streamer might shut down as well once things get busy. I guess the key would be to prevent the processor from feeling too busy.

There's a lot of other rich information in your response. I'll slowly try to digest it.

Thanks!

ps. Any thought on why the MFENCE seems to be important?

McCalpinJohn · ‎10-30-2013

(1) Concerning the use of the Line Fill Buffers: Intel's SW Optimization Guide section 2.2.5.2 (L1 DCache) makes it clear that the LFB's are used for ordinary cache misses and for streaming stores. L1 hardware prefetches are not mentioned explicitly, but there is a performance counter event called "LOAD_HIT_PRE.HW_PF" that counts "Not-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch". The HW counter events are described in Vol 3 of the SW Developer's Guide, section 19.4 (for Sandy Bridge processors).

(2) Section 2.2.5.4 of the Optimization Reference Manual refers to a total of four different hardware prefetch engines. Two hardware prefetchers load data to the L1 DCache -- the "DCU prefetcher" and the "IP-based stride prefetcher". Two hardware prefetchers load data into the LLC and (optionally) to the L2 cache -- the "Spatial Prefetcher" and the "Streamer". Unfortunately the terminology used is often confusing and imprecise. When I say there is "no L2 prefetch", I mean that the "Spatial Prefetcher" and "Streamer" are disabled. I usually disable the L1 prefetchers as well when I do these experiments (because I am usually looking to measure cache miss rates in the absence of prefetch, to get an idea about temporal reuse), but for demonstrating that the bandwidth is limited by LFB occupancy it does not matter if the L1 prefetchers are disabled, since they use the same LFB resources for approximately the same amount of time.

(3) Although it is conceivable that a BIOS could mess up a configuration very badly, the fact that my systems give up to 20 GB/s per socket for a single thread (about 40% of peak) and up to 39 GB/s for 8 threads (about 76% of peak) suggests that the configuration is a good one.

(4) As a quick "sanity check", I just ran Version006 of my ReadOnly benchmark on a Xeon E3-1270 system (1 socket, 4 cores, 3.4 GHz - Turbo disabled, 2 DDR3/1333 channels) with all 16 possible settings of enabling/disabling the four prefetchers. For this code only two of the prefetchers changed the performance by more than 1%:

No HW Prefetch: avg BW = 8900 MB/s
L1 DCU streaming prefetcher = 9120 MB/s (+2.5%)
L2 Streamer = 16400 MB/s (+84.3% relative to no PF)

For this system the open page memory latency is about 54 ns, so the effective concurrency for the three cases above is about 7.5 lines with no HW prefetch, 7.8 lines with the L1 DC streaming prefetcher, and 13.8 lines with the L2 Streamer prefetcher. This last value is 77% of the peak DRAM bandwidth, so it is probably limited by various DRAM stalls more than by available concurrency.

(5) Cache bandwidth limitations are very complex. If the concurrency limit is bigger than the observed bandwidth, that just means that concurrency was not the limiter.

Nathan_K_3 · ‎10-30-2013

Thanks John! More wonderful info.

Like you earlier, I also have run afoul of the broken spam filter, and am unable to reply. Initially I got a message that the submission could not be accepted because the spam filter was inaccessible, and now after several attempts I get nothing but "Your submission has triggered the spam filter and will not be accepted."

I'll try attaching it as a text file. Maybe there is some word describing prefetch that triggers it? It would be nice if someone from Intel could look take a look.

ps: Hilarious: Trying to upload a file gives me the multi-page text dialog starting with "An AJAX HTTP request terminated abnormally." followed by CSS. The first few times I used it I didn't realize that I had to upload the file separately from submitting the post. But the "File Uploaded" is green, so maybe it worked anyway. We'll see.

Nope. Maybe I can paste it in as an edit?

Edit to add: Yes, that seems to be a workaround. Submit a safe message, then paste in the part that bothers the spam filter. Or maybe all just chance.

John D. McCalpin wrote:
(1) Concerning the use of the Line Fill

Buffers: Intel's SW Optimization Guide section 2.2.5.2 (L1 DCache)
makes it clear that the LFB's are used for ordinary cache misses and
for streaming stores. L1 hardware prefetches are not mentioned
explicitly, but there is a performance counter event called
"LOAD_HIT_PRE.HW_PF" that counts "Not-SW-prefetch load dispatches that
hit fill buffer allocated for H/W prefetch".

That's mighty close to a definitive answer that L1 Hardware Prefetches
consume Line Fill Buffers. Thanks!

(4) As a quick "sanity check", I just ran Version006 of my
ReadOnly benchmark on a Xeon E3-1270 system (1 socket, 4 cores, 3.4
GHz - Turbo disabled, 2 DDR3/1333 channels) with all 16 possible
settings of enabling/disabling the four prefetchers. For this code
only two of the prefetchers changed the performance by more than 1%:

No HW Prefetch: avg BW = 8900 MB/s L1 DCU streaming prefetcher = 9120
MB/s (+2.5%) L2 Streamer = 16400 MB/s (+84.3% relative to no PF)

Thanks for running that. I'm surprised that the L2 Spatial Prefetcher
alone (companion line) didn't have a significant effect. I guess this
is because it's too late to be useful. Do you recommend a tool to
turn these Prefetchers on and off programatically? Or do you use
msr-write directly to the MSR's?

For this system the open page memory latency is about 54 ns, so
the effective concurrency for the three cases above is about 7.5 lines
with no HW prefetch, 7.8 lines with the L1 DC streaming prefetcher,
and 13.8 lines with the L2 Streamer prefetcher. This last value is
77% of the peak DRAM bandwidth, so it is probably limited by various
DRAM stalls more than by available concurrency.

If the DRAM stalls are the limiting factor, I'm confused that 4
channel machines would have about the same single-thread performance,
and that running multiple threads on different cores would improve the
bandwidth. Otherwise you'd think you'd be able to get twice the
single thread performance by spreading the load.

(5) Cache bandwidth limitations are very complex. If the
concurrency limit is bigger than the observed bandwidth, that just
means that concurrency was not the limiter.

Yes, it strongly implies that there is some other limiting factor more
stringent than the one you calculated, perhaps a different concurrency
limit that you didn't consider. For RAM, the LFB's seem to be the
most restrictive. But for the other levels, something else is. I'd
love it if some one from Intel could suggest what these limiting
factors might be for reads from the various cache levels.

TimP · ‎10-31-2013

Nathan K. wrote:

Quote:

John D. McCalpinwrote:(1) Concerning the use of the Line Fill
Buffers: Intel's SW Optimization Guide section 2.2.5.2 (L1 DCache)
makes it clear that the LFB's are used for ordinary cache misses and
for streaming stores. L1 hardware prefetches are not mentioned
explicitly, but there is a performance counter event called
"LOAD_HIT_PRE.HW_PF" that counts "Not-SW-prefetch load dispatches that
hit fill buffer allocated for H/W prefetch".

That's mighty close to a definitive answer that L1 Hardware Prefetches
consume Line Fill Buffers. Thanks!

What this clarifies is that loads check for modified data in fill buffers. I suppose you might call that consumption, but not as proof that loads modify contents of fill buffers, which I doubt. It doesn't even tell us how serious a performance hit is incurred when this happens (aside from telling us that someone decided these events might be worth counting). Far afield from the question of stream benchmarks, this is observed to be very slow by running cases where the read back isn't aligned with the preceding store. The Intel MIC compilers have special shuffling optimizations to avoid some of those performance stalls which are observed on Xeon.

There are more effective buffers definitely used by load streams in the good performance cases.

McCalpinJohn · ‎10-31-2013

It is clear from measurements of hardware event MEM_LOAD_UOPS_RETIRED.HIT_LFB that ordinary load misses use the line fill buffers. The description is:

Retired load uops which data sources were load uops missed L1 but hit FB due to preceding miss to the same cache line with data not ready.

This event counts exactly what you would expect if the first load to a cache line (not in the L1 DCache) allocates a line fill buffer and subsequent loads to other parts of that line also miss in the cache, but hit in the LFB (and increment this counter). With contiguous 64-bit loads you get seven of these for every cache miss, while for contiguous 128-bit loads you get three of these for every cache miss. I have measured in a variety of circumstances and the results are unambiguous.

perfwise · ‎11-03-2013

Unfortunately that stat you just mentioned doesn't work upon Haswell. Tim you might verify that.. and send it upstream so it's fixed some time in the future on future derivative part. I think the important thing to realize, which I pointed out previously.. is not the LFB, but the super queue entries which track the L2 misses or requests to the L3 and system memory. They number much greater than the # of LFB. When you turn on the L2 HW prefetcher.. that allow it .. to use those entries and buffer the latency of memory as I mentioned earlier.. to maximize utliziation fo the memory interface. WIthout doing so.. you're simply relying upon the L1 demand requests to utilize all the entries of the super queue or whatever it's now called. That's not possible.. and the L2 prefetcher uses those .. when it's turned on. Also.. every L1 demand miss.. which misses in the event John mentioned above.. replays so as to get the line when it's installed in the L1. That can be observed by looking at the count of the # of retired loads.. and the number of "speculative" MEM LD uops, (pmc 0x40).

perfwise

QIAOMIN_Q_ · ‎11-03-2013

*contiguous* 128-bit loads cann't be hit in the L1 cache after the first miss and its adjcent cache line prefetched? cpu wouldn't allow that, maybe i misunderstood, thanks

perfwise · ‎11-04-2013

QIAOMIN, I don't believe anybody is saying that they hit in the L1D.. but those requests look up the TAG in the L1D and find that they are not in the L1D but hit upon the LFB allocated by a previous miss to the same cacheline. Cpu allows that I believe. Also.. this PMC doesn't work upon Haswell.

perfwise

QIAOMIN_Q_ · ‎11-05-2013

Sorry ,but i meant that when accessing the adjcent cache lines ,not the current hit in the same cahce line.