Intel Optimization Manual 18.104.22.168 gives a Description of Stores in Sandy Bridge:
Reading for ownership and storing the data happens after instruction retirement and follows the order of store instruction retirement.
As long as the store does not complete, its entry remains occupied in the store buffer.
This text looks very confusing to me.
How is that possible that storing data and RFO of the store instruction happens after retirement? L1D on Skylake has best case latency of 4c. RFO may hit in L2 which has best case latency of 12c.
In case of an aligned store that is writing to a line in an Exclusive state it is being placed to SB and then immediately removed after retirement. Then after it is retired and removed from SB storing data which takes at least 4c happens.
Where is the Store occupied after retirement and removing from the Store Buffer and before storing the data is completed? What if some load that touches the same line is retired in the same cycle as the Store, but storing data has not yet finished?
The text says that the output of the store remains in the store buffer until after retirement -- it does not say that the entry is removed from the store buffer *immediately* upon retirement.
Intel has a very complex implementation of store to load forwarding that (in many cases) allows lower latency than serializing the accesses through the L1 Data Cache. Most of the model-specific sections in Chapter 2 of the Intel Optimization Manual include details of which cases each processor generation is able to forward at reduced latency.
I don't think that anything in the documentation precludes additional (undocumented) buffering between the store buffer and the L1 Data Cache.
Stores stay in the store buffer until they commit to the L1 cache (or, in some cases, to another buffer such as a WC buffer).
They certainly can't do that *before* retirement, since the instruction isn't know to be on the good path yet, so after retirement is the start of the period at which the store can commit. At this point they are so-called senior stores, and they wait in line as they commit one-by-one (or in Ice Lake, sometimes two at a time) to the cache. This might take a long time: imagine if you have one store that misses to DRAM: all the other stores behind it will line up and wait until the line comes back from DRAM.
To complicate matters, the "reading for ownership" in the above may not be 100% aligned with the actual cache state.