Intel Optimization Manual 18.104.22.168 gives a Description of Stores in Sandy Bridge:
Reading for ownership and storing the data happens after instruction retirement and follows the order of store instruction retirement.
As long as the store does not complete, its entry remains occupied in the store buffer.
This text looks very confusing to me.
How is that possible that storing data and RFO of the store instruction happens after retirement? L1D on Skylake has best case latency of 4c. RFO may hit in L2 which has best case latency of 12c.
In case of an aligned store that is writing to a line in an Exclusive state it is being placed to SB and then immediately removed after retirement. Then after it is retired and removed from SB storing data which takes at least 4c happens.
Where is the Store occupied after retirement and removing from the Store Buffer and before storing the data is completed? What if some load that touches the same line is retired in the same cycle as the Store, but storing data has not yet finished?
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
The text says that the output of the store remains in the store buffer until after retirement -- it does not say that the entry is removed from the store buffer *immediately* upon retirement.
Intel has a very complex implementation of store to load forwarding that (in many cases) allows lower latency than serializing the accesses through the L1 Data Cache. Most of the model-specific sections in Chapter 2 of the Intel Optimization Manual include details of which cases each processor generation is able to forward at reduced latency.
I don't think that anything in the documentation precludes additional (undocumented) buffering between the store buffer and the L1 Data Cache.
Stores stay in the store buffer until they commit to the L1 cache (or, in some cases, to another buffer such as a WC buffer).
They certainly can't do that *before* retirement, since the instruction isn't know to be on the good path yet, so after retirement is the start of the period at which the store can commit. At this point they are so-called senior stores, and they wait in line as they commit one-by-one (or in Ice Lake, sometimes two at a time) to the cache. This might take a long time: imagine if you have one store that misses to DRAM: all the other stores behind it will line up and wait until the line comes back from DRAM.
To complicate matters, the "reading for ownership" in the above may not be 100% aligned with the actual cache state.
- Of course there is no need for an RFO if the cache line is already in the L1 in E or M state.
- The hardware also supports the PREFETCH_W instruction that has much the same effect as the read for ownership (though it may not use exactly the same opcode).
- Given the existence of the software prefetch for writing, there is nothing that prevents the hardware from issuing such a prefetch operation in advance of the retirement of the corresponding store.
- Because "read for ownership" invalidates the target cache line from all other caches in the system, the performance impact of incorrectly predicted RFOs is much higher than for incorrectly predicted loads.
- So some processors never issue HW prefetches for sequences of stores and other processors issue them much less aggressively than they issue HW prefetches for sequences loads. Documentation is seldom adequate....
- The implementation of HW prefetch may be based on more "state" than a user may expect, and the there may be more parameters in the implementation than a user might expect.