Even after many years of the existence of the mfence instruction (and even more time with the lock prefix), and a fairly careful study of the system programming manual, something still isn't clear to me.
Both mfence and locked instructions have memory ordering effects, generally ensuring sequentially consistent semantics and preventing any reordering across them at least with respect to normal accesses for write-back (WB) memory regions. Are there any cases, however, where the actual, documented or guaranteed memory ordering semantics differ between them? For example, when using non-temporal operations on WB memory regions? When using WC or WT or other types of memory regions other than WB (possibly also mixed with accesses to WB regions)?
The system programming guide doesn't really provide a precisely enough treatment of the topic: section 8.2 deals with memory ordering, but it largely limits itself to the case of WB memory regions, and doesn't handle non-temporal (streaming) operations in a comprehensive way. Various other sections touch on the other cases, and some mention that mfence may be used for ordering (e.g., to flush write-combining buffers when dealing with WC memory regions) - but they don't say that only mfence may be used (leaving open the possibility that lock-prefixed instructions also work in this capacity). Conversely, other locations mention only lock-prefixed instructions for ordering.
So the question is still outstanding: does mfence provide ordering guarantees in any cases a locked-prefix instruction doesn't? Alternately, and less likely, does a lock-prefixed instruction provide ordering guarantees in any case that mfence doesn't?
It looks like the discussion in 11.3 on the WC memory type is intended to apply to WC stores in WB memory regions. Specifically, Section 11.3 says:
If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.
So either MFENCE or a LOCKed instruction will suffice to ensure that the WC buffers are flushed. It is a bit frustrating that there is not an explicit reference to WC stores in the section on WB memory, so we are left to guess whether there are any subtle differences between WC in WC mode and WC in WB mode.
The broader issue of ordering is almost exactly addressed in Section 8.2.5 "Strengthening or Weakening the Memory Ordering Model". This section mentions the use of IO instructions, LOCKed instructions, serializing instructions, and memory ordering instructions, with the following comments:
[Concerning IN and OUT instructions] ....Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.
[Concerning LOCKed instructions] ...Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).
[Concerning serializing instructions] ....Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.
[Concerning MFENCE] MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.
There are lots of places for subtle "gotchas" here.... Questions that come to mind are: