Purpose of Load Buffer in x86

Flu_F_ · ‎06-09-2016

Hi Guys,

Can someone please describe the purpose of "load buffer" that sits between the registers and the L1 cache on most x86 machines.

I am playing around with memory barriers in java and am curious to know how they interact with load and store buffers. I do know that the store buffer is used to write the value of a shared variable rather than invalidating the cacheline and waiting for acks from other cpus.

However, I wasn't able to find much on the "load buffer". Any comments, links to descriptions etc would be most helpful.

Cheers

Apologies in advance if this question has already been asked (I didn't find it while searching).

TimP · ‎06-10-2016

You seem to be asking whether load buffers can work with race conditions in multi threading. I agree I don't find references about this. Memory ordering load and store buffers are documented as supporting store to load forwarding within a single thread so that execution need not stall waiting for cache updates. In the case of race condition they seem to be a path for indeterminacy where execution may proceed using old data from load buffer even though corresponding cache invalidation is pending.

Flu_F_ · ‎06-10-2016

Hello Tim,

Sorry for the confusion. I am afraid I am asking something much more basic than race conditions. :)

Essentially, my question is what purpose does a load buffer serve? Further, what's the effect of a LFENCE on this load buffer.

Thanks

Russell_Van_Zandt · ‎06-15-2016

There is a discussion in Intel's Optimization Reference Manual sections 2.3.5 Cache Hierarchy which includes 2.3.5.1 Load and Store Operation Overview and section 2.3.6 System Agent. That is specifically for the Sandy Bridge architecture; modifications in later systems are discussed in sections 2.1 and 2.2.

McCalpinJohn · ‎06-15-2016

It looks like this is an architecturally invisible feature that exists for the convenience of the designers.

By "architecturally invisible", I mean that it causes no change to the cache consistency and ordering model, so if it is visible at all, it is only visible via its impact on performance.

I suspect that the implementation of buffering between the L1 Data Cache and the processor's physical registers is rather different on different processors. For example, in the Sandy Bridge generation the L1 Data Cache has 8 "banks", each of which is 8 Bytes wide. The L1 Data Cache supports up to 2 16-Byte loads and 1 16-Byte store per cycle, but it is not trivial for the core to generate operations at the corresponding rate. In the case of Sandy Bridge the issues are well known:

The core can only generate two addresses per cycle, so to get 3 16-Byte transfers you need to be using 32-Byte (256-bit AVX) memory operations (each of which takes 2 cycles to move 32 Bytes).
The cache can only perform multiple operations per cycle if they are to non-conflicting banks.
- This can be tricky to understand, because it is not obvious when the stores will need to access the cache -- the exact cycle that the store accesses the L1 Data Cache could easily be offset from the exact cycle that loads executing in the same cycle need to access the L1 Data Cache.
The core+cache can only perform multiple operations per cycle if none of the load or store operations cross a cache line boundary.

There are other special cases to be considered. For example, Sandy Bridge can execute two 8/16/32-bit loads per cycle even if they are grabbing bits from the same bank. If I recall correctly this works whether the loads are to contiguous, discontiguous, or overlapping bit fields within the 8-Byte-aligned data held by the L1 Data Cache bank.

It is easy to imagine that some sort of buffering would be needed to match width and alignment of the core's requests with the width and alignment of the multi-bank L1 Data Cache interface.

Haswell is completely different. My interpretation of the Haswell L1 Data Cache is that it is (for reads) a dual-ported cache with 64-Byte (512-bit) full-cache-line read ports. The core can request up to 2 32-Byte (256-bit) reads and 1 32-Byte (256-bit) store per cycle. The L1 Data Cache can service two reads per cycle for any size and any alignment -- as long as neither of the reads crosses a cache line boundary. In other words, it looks like each of the two read ports of L1 Data Cache reads (up to) the full cache line containing the requested data and can return anywhere between 1 Byte (8-bits) and 32 Bytes (256-bits) for any alignment as long as all of the data requested is within that 512-bit cache line. The cache can also service any two loads from the same cache line in one cycle (independent of alignment and/or overlap). For loads that cross a cache line boundary, it looks like the L1 Data Cache must use both read ports -- one port to read the cache line containing the "lower" part of the requested data, and the other port to read the next cache line that contains the "upper" part of the requested data. This limits throughput to one load per cycle (from the core's perspective). It took a lot of experiments and a lot of thinking to figure out the read behavior of the cache. Because of the potential for cycle skew between loads and stores, I don't know how to extend the methodology to understand how the L1 Data Cache handles stores.

Again, it is easy to imagine that some form of buffering is required to match the width and alignment of the core's requests with the data provided by the L1 Data Cache interface -- and that the implementation of this buffering could be quite different from that required on a Sandy Bridge processor.

TimP · ‎06-15-2016

What I read about the load and store buffers being classed as memory ordering buffers seems to imply that they are actually part of the system to implement the ordering model. As John said, they are meant to be "architecturally invisible," thus aren't mentioned in the Optimization Reference Manual, and may only be of concern to the programmer when loads and stores are mixed (or maybe loads from cache across cores).

With respect to Haswell stores, I find that nontemporal streaming stores may double the performance of store streams exceeding 8KB long where nontemporal is of no use on older Intel CPUs, so it doesn't act as if the store and read for ownership streams are independent. Haswell also avoids the L1 locality performance limitation of Sandy and Ivy Bridge where L1 cache read misses hitting in L2 were limited to 128-bit aligned transfers. So it seems that blocking for L2 may be much more viable on HSW. I've never seen the details written up. But I don't know that these topics have any direct relationship to the topic of load buffers.

Russell_Van_Zandt · ‎06-15-2016

2.2.4.1 Load and Store Operation Enhancements

The L1 data cache can handle two 256-bit load and one 256-bit store operations each cycle. The unified L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store buffers available to support micro-ops execution in-flight.