Memory Buffers

maratyszcza · ‎09-10-2011

Intel optimization guides and presentations mention many types of memory access buffers. These include Fill Buffers (mentioned in therecent Optimization Manual), Write Combining Buffers (mentioned in the same document), Write Buffers (mentioned in the not-so-recent Optimization Manual), and Write-Back Buffers (mentioned in this Pentium 3 review by AnandTech). Are those just different names for the same buffers? If no, what is the difference between them?

Patrick_F_Intel1 · ‎09-10-2011

Hello maratyszcza,
These are a questions thatare easy to ask but hard to answer.
On Sandy Bridge you can see various buffers in block diagramFigure 2.1 in the Optimization manual.
If you search for 'fill buffer' in the PDF you can see that the Line fill buffer (LFB) is allocated after an L1D miss. The LFB holds the data as it comes in to satisfy the L1D miss but before all the data is ready tobe written to the L1D cache.
Load and store buffers (in figure 2-1) are used to hold loads and stores to memory. These can take a relatively long time to complete. For instance, stores happen after the instruction retires and the store buffer keeps hold the address and data until the store retires.
Write combining buffers (on atom, core, core duo, core solo, P4) hold modified cache lines before the stores are complete. These are similar to (but not exactly the same as)store buffers on nehalem & sandy bridge.
I imagine that write back buffers are similar to store buffers (but probably simpler).
Does this help?
Pat

TimP · ‎09-10-2011

My understanding was that write combining buffers (Pentium 4 through Prescott and the like) had analogous function to fill buffers (beginning with Woodcrest), except that WCBs were flushed to L2, fill buffers to L1D (as Patrick says above). Also, on the (now obsolete) CPUs which I worked with, the final number of WCBs was 6 per core, while the fill buffers have been 10 per core. In either case, there is an automatic flush of the least recently used buffer when less than 2 clean buffers are available.
Also, HyperThreads increase the demand for these buffers; only half of the WCBs were available per thread when HT was active, while, when HT was re-introduced to the CPUs with fill buffers, a demand-based partitioning became available. As a consequence, HyperThreading may be more effective for applications which don't demand many of these buffers.

maratyszcza · ‎09-11-2011

Thank you for timely answers. However, one thing that confuses me is that some CPUs are claimed to have several types of these buffers. One example is Pentium III/Coppermine, which have 8 Fill Buffers and 6 Write-Back Buffers (all numbers are from this article). What is the difference in their usage in Coppermine? Another example is Core 2, which seems to have 8 Fill Buffers, 8 Write Combining Buffers, and 4 Write-Back Buffers (found that data in this post about the Hot Chips conference).

Also, inthis ISN article we can read that "Consult the Intel 64 and IA-32 Architectures Optimization Reference Manual for the number of fill buffers in a particular processor; typically the number is 8 to 10. Note that sometimes these are also referred to as "Write Combining Buffers", since on some older processors only streaming stores were supported." Does it mean that Write Combining Buffer in just another name for Fill Buffer?

TimP · ‎09-11-2011

We've attempted to explain that the Write Combining Buffers of the Netburst CPUs perform an analogous function to the fill buffers introduced in Woodcrest. I believe you're correct that Woodcrest had 8 of those per core, while Nehalem and more recent have 10 per core. Although this architectural change was made with Woodcrest, the literature didn't catch up for a long time.
From the point of view of the effect on code optimization, this combining buffer effect appeared on the SPARC architecture before it was introduced in Pentium4. It limited the number of store streams (array sections) supported efficiently per loop to 9, same as the 10 fill buffer CPUs support efficiently (when limited to 1 thread per core).
I agree that the part you quote about "since on some older processors only streaming stores were supported" doesn't make sense. Perhaps it refers to CPUs such as Pentium-III which predate the Write Combining Buffer scheme. The context you quote is a discussion of the "streaming" or "non-temporal" stores which bypass cache. In practice, it is unusual to be able to put more than 1 non-temporal store stream (for an array where the pre-existing contents are ignored and replaced, as in a memset operation) per loop.
In attempting to recognize the existence of the Atom (and prototype MIC architecture) without specifically explaining how they differ, these documents also tend to create confusion.