The memory ordering semantics of mfence versus those of locked instructions

Travis_D_ · ‎05-09-2018

Even after many years of the existence of the mfence instruction (and even more time with the lock prefix), and a fairly careful study of the system programming manual, something still isn't clear to me.

Both mfence and locked instructions have memory ordering effects, generally ensuring sequentially consistent semantics and preventing any reordering across them at least with respect to normal accesses for write-back (WB) memory regions. Are there any cases, however, where the actual, documented or guaranteed memory ordering semantics differ between them? For example, when using non-temporal operations on WB memory regions? When using WC or WT or other types of memory regions other than WB (possibly also mixed with accesses to WB regions)?

The system programming guide doesn't really provide a precisely enough treatment of the topic: section 8.2 deals with memory ordering, but it largely limits itself to the case of WB memory regions, and doesn't handle non-temporal (streaming) operations in a comprehensive way. Various other sections touch on the other cases, and some mention that mfence may be used for ordering (e.g., to flush write-combining buffers when dealing with WC memory regions) - but they don't say that only mfence may be used (leaving open the possibility that lock-prefixed instructions also work in this capacity). Conversely, other locations mention only lock-prefixed instructions for ordering.

So the question is still outstanding: does mfence provide ordering guarantees in any cases a locked-prefix instruction doesn't? Alternately, and less likely, does a lock-prefixed instruction provide ordering guarantees in any case that mfence doesn't?

McCalpinJohn · ‎05-10-2018

It looks like the discussion in 11.3 on the WC memory type is intended to apply to WC stores in WB memory regions. Specifically, Section 11.3 says:

If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

So either MFENCE or a LOCKed instruction will suffice to ensure that the WC buffers are flushed. It is a bit frustrating that there is not an explicit reference to WC stores in the section on WB memory, so we are left to guess whether there are any subtle differences between WC in WC mode and WC in WB mode.

The broader issue of ordering is almost exactly addressed in Section 8.2.5 "Strengthening or Weakening the Memory Ordering Model". This section mentions the use of IO instructions, LOCKed instructions, serializing instructions, and memory ordering instructions, with the following comments:

[Concerning IN and OUT instructions] ....Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.

[Concerning LOCKed instructions] ...Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

[Concerning serializing instructions] ....Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.

[Concerning MFENCE] MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.

There are lots of places for subtle "gotchas" here.... Questions that come to mind are:

Can instruction fetches and page table walks pass LOCKed instructions or serializing instructions or memory fences?
Is there a difference between the concept "drain to memory" (used in the first 3 examples) and the type of ordering implied in the discussion of the memory fence instructions?
- In some architectures there are different conceptual frameworks used for WB memory and MMIO.
- For WB memory, the most that the processor (core) can do is ensure that the results of the store are visible to the coherence mechanism. This typically means flushing store buffers to the cache and flushing WC buffers to memory. Flushing the store buffers to cache is a local activity, so the core can "know" when it is complete. Flushing the WC buffers to memory is not local. Writes can be implemented as "posted" (expecting no completion message) or "non-posted" (expecting a completion message). The core would need to use a "non-posted" write if it needed to wait until the data was actually received by the memory controller. I don't know if there is enough information to know what low-level transactions Intel uses in these cases. Of course there is also the issue of buffering inside the memory controller -- flushing the WC buffers to memory may put the data in (snooped) buffers in the memory controller, with no way to enforce writing to the actual DRAM.
- For MMIO, the transactions are not local, so the issues are similar to those of flushing WC buffers. Writes to PCIe devices can be either posted (memory writes) or non-posted (IO and configuration writes). For posted memory writes, the core cannot know that the write has arrived at the destination without an additional synchronization step.
I often refer to this set of issues as the distinction between "guaranteed ordering" and "confirmed delivery".
- Some operations require that a set of memory operations arrive at the destination in a particular order.
  - This is often guaranteed by lower-level protocols if the operations are all writes, but not for mixtures of reads and writes, or for sequences of reads.
- Some operations require a guarantee that a memory operation has been accepted at the destination before a subsequent transaction can start.
  - This is usually a workaround when the actual ordering semantics desired cannot be specified by the protocol or when the actual ordering semantics desired cannot be supported by the hardware.
- The inability to specify and support a variety of different ordering models that can be chosen to match the application requirement(s) often leads to serious inefficiency for fine-grained interactions.