Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1741 Discussions

What is the semantics (ordering/atomicity) of concurrent PCIe transactions targeting same memory?

BinYan0316
Beginner
611 Views

Hi,

I would like to know the semantics (ordering / atomicity) that Intel's CPU can guarantee under concurrent (read and write) PCIe targeting the same memory address.

 

Specifically, let say a device issues two PCIe write (e.g., 256 bytes) simultaneously, targeting the same memory address.

- Is there any granularity that the memory will be updated atomically? For example, 8 bytes, 64 bytes, or no?

- Is there any ordering ensured between the two writes? 

- Will PCIe write updates memory with address ordering (i.e., from lower address to higher address)?

 

Furthermore, if the device also issues a concurrent PCIe read targeting the same address.

- Will the read witness atomic updates at any granularity?

- Will the read reads memory with address ordering?

 

What I know: I've read the PCIe specification and understands the basic ordering rule. However, I do not know whether the CPU's memory controler (or cache?) also follow the similar ordering. (For example, although I know PCIe will not reorder two writes transaction, I still can not figure out whether CPU's memory controler will reorder the two writes). 

 

Is there a way to know the ordering/atomicity semantics of concurrent PCIe transactions? 

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
319 Views

How would you define "simultaneously" in this context?

How do you think a device could "generate" two PCIe write transactions in one cycle?  The PCIe block in the device is only going to be able to drive the transmission of one transaction on the PCIe bus at a time.  The IIO block in the processor mesh is limited to using a single data path to the memory controller, with each cache line of data requiring two mesh transfers over three mesh cycles. 

With multiple PCIe devices, it is possible to generate multiple writes to the same system memory address, but this is a race condition that is not guaranteed to deliver well-defined results.  

In general, one would expect something like:

  1. cacheline-aligned PCIe write transactions transactions will deliver each cache line of data atomically to the system memory controller owning the target cache line address, and
  2. the memory controller will process each 64-Byte transaction atomically, and in some order

Looking at this from the mesh, PCIe writes entering the chip through different IIO blocks will take different paths to the system memory controller.   There are three cases to consider here:

  1. In many cases the Y-X routing will result in the two transactions entering the IMC block on the same mesh link, where of course they have to be ordered (taking 4 mesh cycles to deliver both data transfers -- interleaved in alternating cycles).  
  2. In some cases the Y-X routing will result in the two transactions entering the IMC block on opposite directions of the same mesh link.  In this case the hardware requires that the data be received in alternate cycles (i.e., a mesh stop can only receive traffic in the "up" direction on even cycles and in the "down" direction on odd cycles - or the reverse).  (Ice Lake Xeon in the 40-core die is the first  processor with PCIe interfaces on both ends of a column containing a memory controller.)
  3. Finally it is possible that the Y-X mesh routing will result in one of the writes entering on a vertical mesh link and the other entering on a horizontal mesh link.  For this case I have not seen any official statements about ability of a target to receive data from both links in the same cycle (and unfortunately the BIOS in my system prevents me from accessing the M2M performance counters that would help me to test this at the IMCs).   From measurements at core/LLC mesh stops I have seen that this case does generate retries on the mesh (meaning that at least some of the time the vertical and horizontal traffic "collides" and results in a retry of the transaction on the vertical link), but the retry rates are low enough that I have not tried to model the mechanism(s).  

So in the first two cases the PCIe writes have to arrive at the IMC in some order, and in the third case the PCIe writes *might* have to arrive at the IMC in some order.  If they are actually allowed to be received in the same mesh cycle on horizontal and vertical mesh inputs, they are almost certainly ordered internally.  Since this is a race condition and any ordering is allowed, I assume something simple like "horizontal first, vertical second" would serve as an ordering condition (if this case is allowed to happen at all).

These cases are my moderately-informed guesses as to the mostly likely behavior for the well-behaved case of cacheline-aligned full-cacheline or multi-cacheline writes.   Nothing bigger than a cache line is ever guaranteed to be atomic -- especially since cache lines are interleaved across memory controller channels (usually at a granularity of naturally-aligned 256-byte blocks).  For transactions smaller than cache lines (or crossing cache line boundaries) the architecture does not define the ordering -- so the hardware is allowed to provide any interleaving of the incoming data at any byte-level granularity.  This is pretty clear from the discussions of ordering in the Intel SWDM (e.g., Chapter 9 of the volume 3, document 325384, revision 082).

0 Kudos
Reply