Solved: Intel has traditionally

cianfa72 · ‎10-13-2015

Reading Intel SDM vol 3 (Jun 2015) and Intel Optimization Reference Manual (Sep 2015) I've some doubts about memory caching type, in particular Write-Back (WB) with write combining support and Write-Combinig (WC) memory type. Just to fix ideas consider Haswell microarchitecture

Intel SDM section 11.3 "Methods of caching available" has no reference for WB type to Line fill buffer whereas in Optimization manual section 2.3.5.2 there is an explicit reference to them in Table 2-18. Quoting that section:

(L1 DCache) can maintain up to 36 store operations from allocation until the store is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores

here it seems LFBs are involved just in the case of non-temporal store that, as far as I understand, require a WC memory type (in other worlds WC memory type is needed for non-temporal stores)

On the other hand, as discusses here https://software.intel.com/it-it/forums/software-tuning-performance-optimization-platform-monitoring/topic/595220#comment-1843382 , it seems Line fill buffers are also involved for cache line fill (filling cache line from memory hierarchy) or when data has to be written in memory even with WB memory type

Can you help me in understanding better how things really work ? Thanks

McCalpinJohn · ‎10-13-2015

Intel has traditionally refrained from providing detailed descriptions of implementations, and this case is no different, but I think that the preponderance of evidence is fairly clear....

The Optimization Reference Manual shows the Line Fill Buffers as sitting between the L1 Data Cache and the L2 cache in Figures 2-1 (Haswell) and 2-4 (Sandy Bridge). The text of Section 2.2.5.2 (Sandy Bridge L1 Data Cache) says:

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

The L1 DCache can handle multiple outstanding cache misses and continue to service incoming stores and loads. Up to 10 requests of missing cache lines can be managed simultaneously using the LFB.

The L1 DCache is a write-back write-allocate cache. Stores that hit in the DCU do not update the lower levels of the memory hierarchy. Stores that miss the DCU allocate a cache line.

It is also relevant to note that Chapter 19 of Volume 3 of the SWDM describes performance counter events such as:

MEM_LOAD_UOPS_RETIRED.HIT_LFB   Retired Load uops which data sources were load uops missed L1 but hit LFB due to preceding miss to the same cache line with data not ready.

LOAD_HIT_PRE.SW_PF   Not SW-prefetch load dispatches that hit fill buffer allocated for SW prefetch.

LOAD_HIT_PRE.HW_PF   Not SW-prefetch load dispatches that hit fill buffer allocated for HW prefetch.

From these sources, it seems reasonable to conclude that the Line Fill Buffers do (at least) two things:

They track L1 Data Cache load misses and store misses.
They provide a temporary storage area for the outbound data generated by streaming (non-temporal) stores.

It is harder to tell (and perhaps not important to know) whether the Line Fill Buffers actually buffer the incoming data for L1 Data Cache misses before it is written to the L1. It is also hard to tell whether the storage available for streaming (non-temporal) stores is as large as the 10 cache lines that the LFBs support for outstanding load/store misses or whether the space available for streaming (non-temporal) stores competes with the entries available to manage L1 Data Cache misses. It is hard to tell whether the connection between the Line Fill Buffers (tracking cache lines) and the Load Buffers (tracking load uops) is made via back-pointers in the LFB, forward pointers in the Load Buffers, a combination or both, or some completely different mechanism.

Anyone can attempt to design directed tests to attempt to evaluate various hypotheses about implementation details, but this is not easy work and the results seldom as clear-cut as one would like. Disabling the hardware prefetchers often helps the results make sense, but does not necessarily help with understanding the behavior of the system when hardware prefetching is enabled.

So getting back to the WC vs WB.

WB is the default and it is the way the system works most of the time for processor accesses to system memory.
- Store misses cause the target cache line to be read into the requesting core's Data Cache and invalidated in all other caches.
  - Note that this store miss is a type of Data Cache miss, which will use a Line Fill Buffer to track/manage as a "load with intent to modify" transaction.
- Once the target cache line has been returned to the L1 Data Cache, the core can issue stores into that line.
  - These stores are buffered through a Store Buffer and then written to the cache.
- When the cache line is chosen as the victim, it is written back to the next level of the cache.
WC is mostly a mode that was developed for IO -- allowing a processor to perform strongly ordered uncached reads, but allowing weakly ordered streaming stores.
- The streaming functionality developed for WC memory-mapped IO is also available for WB memory ranges using the "MOVNT*" family of streaming store functions.
- If a streaming store misses in the Data Cache, it does not read the target cache line into the Data Cache -- instead it collects the data from the stores in "write combining buffers", attempting to get collect full cache lines that can be transferred to memory with maximum efficiency.
- On Intel processors there are at least two references that say that the "write combining buffer" functionality is implemented in the Line Fill Buffers.
  - When streaming stores are sent to memory, an invalidation transaction is transmitted to all caching agents in the system to invalidate the corresponding line.
  - Note that this is done *before* the store is allowed to complete with normal stores that allocate a line in the data cache, while it can be done *after* the store is allowed to complete with streaming stores.
  - This difference in the timing of the invalidations is one of the reasons that FENCE instructions are needed to enforce ordering between streaming stores and ordinary allocating stores.

View solution in original post

McCalpinJohn · ‎10-13-2015

Intel has traditionally refrained from providing detailed descriptions of implementations, and this case is no different, but I think that the preponderance of evidence is fairly clear....

The Optimization Reference Manual shows the Line Fill Buffers as sitting between the L1 Data Cache and the L2 cache in Figures 2-1 (Haswell) and 2-4 (Sandy Bridge). The text of Section 2.2.5.2 (Sandy Bridge L1 Data Cache) says:

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

The L1 DCache can handle multiple outstanding cache misses and continue to service incoming stores and loads. Up to 10 requests of missing cache lines can be managed simultaneously using the LFB.

The L1 DCache is a write-back write-allocate cache. Stores that hit in the DCU do not update the lower levels of the memory hierarchy. Stores that miss the DCU allocate a cache line.

It is also relevant to note that Chapter 19 of Volume 3 of the SWDM describes performance counter events such as:

MEM_LOAD_UOPS_RETIRED.HIT_LFB   Retired Load uops which data sources were load uops missed L1 but hit LFB due to preceding miss to the same cache line with data not ready.

LOAD_HIT_PRE.SW_PF   Not SW-prefetch load dispatches that hit fill buffer allocated for SW prefetch.

LOAD_HIT_PRE.HW_PF   Not SW-prefetch load dispatches that hit fill buffer allocated for HW prefetch.

From these sources, it seems reasonable to conclude that the Line Fill Buffers do (at least) two things:

They track L1 Data Cache load misses and store misses.
They provide a temporary storage area for the outbound data generated by streaming (non-temporal) stores.

It is harder to tell (and perhaps not important to know) whether the Line Fill Buffers actually buffer the incoming data for L1 Data Cache misses before it is written to the L1. It is also hard to tell whether the storage available for streaming (non-temporal) stores is as large as the 10 cache lines that the LFBs support for outstanding load/store misses or whether the space available for streaming (non-temporal) stores competes with the entries available to manage L1 Data Cache misses. It is hard to tell whether the connection between the Line Fill Buffers (tracking cache lines) and the Load Buffers (tracking load uops) is made via back-pointers in the LFB, forward pointers in the Load Buffers, a combination or both, or some completely different mechanism.

Anyone can attempt to design directed tests to attempt to evaluate various hypotheses about implementation details, but this is not easy work and the results seldom as clear-cut as one would like. Disabling the hardware prefetchers often helps the results make sense, but does not necessarily help with understanding the behavior of the system when hardware prefetching is enabled.

So getting back to the WC vs WB.

WB is the default and it is the way the system works most of the time for processor accesses to system memory.
- Store misses cause the target cache line to be read into the requesting core's Data Cache and invalidated in all other caches.
  - Note that this store miss is a type of Data Cache miss, which will use a Line Fill Buffer to track/manage as a "load with intent to modify" transaction.
- Once the target cache line has been returned to the L1 Data Cache, the core can issue stores into that line.
  - These stores are buffered through a Store Buffer and then written to the cache.
- When the cache line is chosen as the victim, it is written back to the next level of the cache.
WC is mostly a mode that was developed for IO -- allowing a processor to perform strongly ordered uncached reads, but allowing weakly ordered streaming stores.
- The streaming functionality developed for WC memory-mapped IO is also available for WB memory ranges using the "MOVNT*" family of streaming store functions.
- If a streaming store misses in the Data Cache, it does not read the target cache line into the Data Cache -- instead it collects the data from the stores in "write combining buffers", attempting to get collect full cache lines that can be transferred to memory with maximum efficiency.
- On Intel processors there are at least two references that say that the "write combining buffer" functionality is implemented in the Line Fill Buffers.
  - When streaming stores are sent to memory, an invalidation transaction is transmitted to all caching agents in the system to invalidate the corresponding line.
  - Note that this is done *before* the store is allowed to complete with normal stores that allocate a line in the data cache, while it can be done *after* the store is allowed to complete with streaming stores.
  - This difference in the timing of the invalidations is one of the reasons that FENCE instructions are needed to enforce ordering between streaming stores and ordinary allocating stores.

WB vs WC memory type