Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1694 Discussions

Ensuring the completion of DMA write in coherent buffers


Hello All,

Are Individual DMA writes to coherent memory atomic OR is it done in discrete chunks (of size cacheline)

I am curious to know if DMA writes to coherent memory (allocated by dma_alloc_coherent) by PCIe device are re-ordered somehow e.g. PCH reorders memory stores for optimizations OR cache infrastructure flushes the cacheline to memory not in the order in which DMA writes are issued etc.


  1. Coherent buffers are allocated in kernel using dma_alloc_cohernet and mapped to userspace using mmap(). Buffer is slotted in to 256 Bytes chunks.
  2. Our proprietary device issues 256 Bytes single DMA write to these chunks over PCIe bus.
  3. Software detects availability of new data by polling on first 2 bytes and last 8 bytes of single chunk.
  4. After detecting the arrival of fresh data, CPU reads the 256 bytes chunk.
  5. Sfence and lfence is used before reading the chunk.



Reading 256 bytes coherent memory by CPU gives the unexpected (stale) data at random location within chunk (between of first and last Qword of chunk), however CPU can read the expected data after some time.



Can someone explain the Observed behaviour?  And Suggest solution to take care of this.

0 Kudos
1 Reply
Black Belt

Unfortunately I can only answer part of this question....

DMA writes from the device to system memory traverse at least two protocols.  

  1. In the first step, the device sends a PCIe Memory Write transaction to the PCIe controller.
    • PCIe allows payload sizes up to 4096 Bytes, but this will be limited to the smaller of the sizes supported by the root complex and the device.  Since the root complex belongs to the CPU, it is out of your control, and it may limit payloads to less than 256 Bytes.
  2. In the second step, the PCIe controller on the processor chip sends the data to "system memory" using the processors proprietary on-chip fabric protocol.
    • These protocols are optimized for transferring 64 Byte cache lines around the system, so it would not be surprising if no larger sizes were supported "natively".   Section 8.1 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-071, October 2019) describes the minimal set of transaction types that are guaranteed to be atomic.
    • This is especially likely given Intel's approach to handling coherence -- consecutive cache line addresses are mapped to (a permutation on) the set of Caching and Home Agents (CHAs), with up to 28 CHAs per chip (or 38 on Xeon Phi x200).  
    • On Xeon Scalable Processors (SKX & CLX), the interface between IO and coherent memory is a box called the IRP.  (This is minimally documented in the Scalable Memory Family Uncore Performance Monitoring Manual, document 336274.)  Any ordering relationships within a sequence of 64-Byte (or smaller) transactions would have to be managed by the IRP (but I don't know of any real description of what the IRP does or how it works.)

The ordering failure you are seeing is reminiscent of what you are allowed to see with "weakly-ordered" stores.  If a core uses the Write-Combining memory type or uses non-temporal (streaming) stores to the WriteBack memory type, other processors are allowed to see the results of those stores out of order -- both *within* and *across* cache lines.  This is documented in Section 11.3.1 of Volume 3 of the Intel SWDM.   What they don't explain is *why* this happens.  The out-of-order visibility is due to the combination of two factors: (1) the processor cores are allowed to flush their "write-combining buffers" to memory in any order, and with any number of sub-transactions (in any order) that are supported by the undocumented on-chip fabric protocol; and (2) unlike "ordinary" stores (but like DMA writes), write-combining stores don't begin their interaction with the coherence fabric until the core has flushed the buffer and the resulting write transaction (which behaves almost exactly the same way as a DMA write from IO) has reached the coherence agent for that address.  So two transactions could be presented (from the IRP) to the mesh in order, but in general they are going to have to travel different distances to the corresponding CHAs, and these CHAs will be located at different distances from the various cores.  It is not inconceivable that if the first transaction has to go all the way across the mesh and then send the invalidations all the way across the chip, a second transaction targeting a CHA adjacent to the IRP that sends an invalidation to a core that is "close to" the CHA could cause the second invalidation to occur in that core before the first -- leaving a window to read stale data from the first line after the second has been reloaded with current data that it obtained from the nearby L3 slice. (Intel processors that support Direct Cache Access (DCA) are typically configured to place the data from IO DMA writes into the L3 cache.)

For "weakly-ordered" stores, Section 22.34 of Volume 3 of the SWDM says that the "SFENCE" instruction is the most efficient way to ensure that ordering is observed between a set of weakly-ordered stores and subsequent normal stores.   The compiler will generate slightly stronger MFENCE instructions after a loop using streaming stores to ensure that no later reads can be executed early enough to see the results of those streaming stores out-of-order. 

The part that I can't help with is finding the corresponding fence operation for IO DMA operations.   Normally one uses an interrupt-driven approach, with the processor notified by interrupt after the DMA writes to coherent memory are complete.  I have been unable to find a description of the specific minimal mechanisms required to ensure that all of the invalidations related to the DMA writes to coherent memory have been completed.   It is possible that this is supposed to work correctly automatically (as long as the PCIe write transactions are to the same Transaction Class, the same Virtual Channel, and don't have either the Relaxed Ordering or No Snoop attribute bits set in the transaction header), but I can't find any clear documentation on the applicable ordering rules in the hardware....

0 Kudos