Solved: How does WC-buffer relate to LFB?

ebashinskii__ebashin · ‎01-20-2020

It is not clear from Intel Software Optimization Manual how WC-buffer and LFB relate to each other. Considering Haswell there is a pipeline schema that looks as follows:

There are Line Fill Buffer and Store Buffer depicted in the schema. But where is WC-buffer? Is it the same as LFB? It was noted in Intel Optimization Manual/3.6.9 that

Write combining buffers are used for stores of all memory types

Meaning that Non-Temporal and regular stores to WB memory are going through WC-buffer anyway. The QUESTION is How do stores get into WC-buffer?

My current vision of the store execution is this:

1. Store instruction is decoded by the Front End

2. The Store uops go into Store Buffer

3.1 For NT-store it goes directly to LFB avoiding doing RFO request. It will be flushed as soon as the whole cache line is combined or sfence is occurred or it is evicted due to LFB goes out of space causing multiple bus transactions for flush incomplete cache line.

3.2 For regular stores it goes into LFB in case L1D store-miss occurs (No cache line or the line is not in the right state) and performs RFO request, otherwise L1D store occurs without going further to L2 and above.

It is most likely to be wrong since there is no WC-buffer in the store execution flow that I described.

I'd like to ask for corrections of the store execution flow in case of Haswell for NT/regular stores into WB memory type.

McCalpinJohn · ‎01-27-2020

I don't know if Intel has disclosed enough details to be completely confident about the implementations. Intel clearly re-uses a lot of the implementation from one generation to another, but that is no guarantee that a disclosure about the implementation of one generation will carry over to subsequent generations.....

Some thoughts:

In the figure above, the "Load Buffers, Store Buffers, Reorder Buffers" are connected to the "Allocate/Rename/Retire" unit, which implies that they are buffers for linking various types of uops to the physical registers and to the idealized Byte-addressable memory.
In contrast to the load and store buffers, the "Line Fill Buffers" are located between the L1 Data Cache and the L2 cache -- so these are directly tied to cache line transfers, and not to specific uops.
Section 2.5.5.1 of the Optimization Manual describes stores to WriteBack memory in Sandy Bridge:
- "Reading for ownership and storing the data happens after instruction retirement and follows the order of store instruction retirement. [...] As long as the store does not complete, its entry remains occupied in the store buffer. When the store buffer becomes full, new micro-ops cannot enter the execution pipe and execution might stall."
- So for stores to WB memory, the "store buffer" entries contain the data from store uops between the time the store executes and some time after the store uop is retired when the data is transferred to the L1 Data Cache.
- Interestingly, the term "retire" does not occur in Section 8.2 "Memory Ordering" in Volume 3 of the SW Developer's Manual. Instead, this section only discusses instruction "execution".
Section 2.5.5.2 of the Optimization Manual describes the L1 Data Cache in Sandy Bridge:
- "The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores."
- "The L1 DCache can handle multiple outstanding cache misses and continue to service incoming stores and loads. Up to 10 requests of missing cache lines can be managed simultaneously using the LFB."
Section 3.6.9 of the optimization manual also says:
- "Beginning with Nehalem microarchitecture, there are 10 buffers available for write-combining."
- The number 10 matches the number of LFBs for both Nehalem and Sandy Bridge reported in Section 2.5.5.2

Putting this together suggests to me that "write combining" is a *function* that is supported by the LFBs, so it does not require separate buffers.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....

View solution in original post

McCalpinJohn · ‎01-27-2020

I don't know if Intel has disclosed enough details to be completely confident about the implementations. Intel clearly re-uses a lot of the implementation from one generation to another, but that is no guarantee that a disclosure about the implementation of one generation will carry over to subsequent generations.....

Some thoughts:

In the figure above, the "Load Buffers, Store Buffers, Reorder Buffers" are connected to the "Allocate/Rename/Retire" unit, which implies that they are buffers for linking various types of uops to the physical registers and to the idealized Byte-addressable memory.
In contrast to the load and store buffers, the "Line Fill Buffers" are located between the L1 Data Cache and the L2 cache -- so these are directly tied to cache line transfers, and not to specific uops.
Section 2.5.5.1 of the Optimization Manual describes stores to WriteBack memory in Sandy Bridge:
- "Reading for ownership and storing the data happens after instruction retirement and follows the order of store instruction retirement. [...] As long as the store does not complete, its entry remains occupied in the store buffer. When the store buffer becomes full, new micro-ops cannot enter the execution pipe and execution might stall."
- So for stores to WB memory, the "store buffer" entries contain the data from store uops between the time the store executes and some time after the store uop is retired when the data is transferred to the L1 Data Cache.
- Interestingly, the term "retire" does not occur in Section 8.2 "Memory Ordering" in Volume 3 of the SW Developer's Manual. Instead, this section only discusses instruction "execution".
Section 2.5.5.2 of the Optimization Manual describes the L1 Data Cache in Sandy Bridge:
- "The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores."
- "The L1 DCache can handle multiple outstanding cache misses and continue to service incoming stores and loads. Up to 10 requests of missing cache lines can be managed simultaneously using the LFB."
Section 3.6.9 of the optimization manual also says:
- "Beginning with Nehalem microarchitecture, there are 10 buffers available for write-combining."
- The number 10 matches the number of LFBs for both Nehalem and Sandy Bridge reported in Section 2.5.5.2

Putting this together suggests to me that "write combining" is a *function* that is supported by the LFBs, so it does not require separate buffers.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....