Are Individual DMA writes to coherent memory atomic OR is it done in discrete chunks (of size cacheline)
I am curious to know if DMA writes to coherent memory (allocated by dma_alloc_coherent) by PCIe device are re-ordered somehow e.g. PCH reorders memory stores for optimizations OR cache infrastructure flushes the cacheline to memory not in the order in which DMA writes are issued etc.
Reading 256 bytes coherent memory by CPU gives the unexpected (stale) data at random location within chunk (between of first and last Qword of chunk), however CPU can read the expected data after some time.
Can someone explain the Observed behaviour? And Suggest solution to take care of this.
Unfortunately I can only answer part of this question....
DMA writes from the device to system memory traverse at least two protocols.
The ordering failure you are seeing is reminiscent of what you are allowed to see with "weakly-ordered" stores. If a core uses the Write-Combining memory type or uses non-temporal (streaming) stores to the WriteBack memory type, other processors are allowed to see the results of those stores out of order -- both *within* and *across* cache lines. This is documented in Section 11.3.1 of Volume 3 of the Intel SWDM. What they don't explain is *why* this happens. The out-of-order visibility is due to the combination of two factors: (1) the processor cores are allowed to flush their "write-combining buffers" to memory in any order, and with any number of sub-transactions (in any order) that are supported by the undocumented on-chip fabric protocol; and (2) unlike "ordinary" stores (but like DMA writes), write-combining stores don't begin their interaction with the coherence fabric until the core has flushed the buffer and the resulting write transaction (which behaves almost exactly the same way as a DMA write from IO) has reached the coherence agent for that address. So two transactions could be presented (from the IRP) to the mesh in order, but in general they are going to have to travel different distances to the corresponding CHAs, and these CHAs will be located at different distances from the various cores. It is not inconceivable that if the first transaction has to go all the way across the mesh and then send the invalidations all the way across the chip, a second transaction targeting a CHA adjacent to the IRP that sends an invalidation to a core that is "close to" the CHA could cause the second invalidation to occur in that core before the first -- leaving a window to read stale data from the first line after the second has been reloaded with current data that it obtained from the nearby L3 slice. (Intel processors that support Direct Cache Access (DCA) are typically configured to place the data from IO DMA writes into the L3 cache.)
For "weakly-ordered" stores, Section 22.34 of Volume 3 of the SWDM says that the "SFENCE" instruction is the most efficient way to ensure that ordering is observed between a set of weakly-ordered stores and subsequent normal stores. The compiler will generate slightly stronger MFENCE instructions after a loop using streaming stores to ensure that no later reads can be executed early enough to see the results of those streaming stores out-of-order.
The part that I can't help with is finding the corresponding fence operation for IO DMA operations. Normally one uses an interrupt-driven approach, with the processor notified by interrupt after the DMA writes to coherent memory are complete. I have been unable to find a description of the specific minimal mechanisms required to ensure that all of the invalidations related to the DMA writes to coherent memory have been completed. It is possible that this is supposed to work correctly automatically (as long as the PCIe write transactions are to the same Transaction Class, the same Virtual Channel, and don't have either the Relaxed Ordering or No Snoop attribute bits set in the transaction header), but I can't find any clear documentation on the applicable ordering rules in the hardware....