Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Pcie access from multiple threads to fpga results in transactions that are interspersed



basically I have region of memory that is marked as wrote combined which is mapped to the fpga bar0 and i attempt to transfer arbitrary amounts of data across to the fpga and occasionally when I call the function that transfers the data from multiple threads the transactions from the multiple threads get interspersed.  The function that does the writing has a lock around it so I’m not sure how this is happening.  Anyway here is basically the function I am using:

I have this:


void send_message(uint16_t message_length, uint8_t *message, exchange_top_state_t *state) {


  uint64_t *my_ptr = (uint64_t *) message;

  uint32_t actual_message_length;

  uint64_t *dma_ptr = (uint64_t *) state->dma;

  int i;

  uint8_t *first_byte = (uint8_t *) my_ptr;


  // message length in 64-bit words:

  if (*first_byte == 0x1 || *first_byte == 0x2 || *first_byte == 0x3 || *first_byte == 0x4 || *first_byte == 0x5) {

    actual_message_length = ((message_length+4) >> 3) + (((message_length+4) & 0x7) > 0);

  } else {

    actual_message_length = (message_length >> 3) + ((message_length & 0x7) > 0) + 2;


  while(actual_message_length>0) {

    if (actual_message_length>=8) {

      for (i=0; i<8; i++) {

        *(dma_ptr+i) = *(my_ptr+i);



      my_ptr  = my_ptr+8;

      dma_ptr = dma_ptr+8;


      actual_message_length = actual_message_length - 8;

    } else {

      for (i=0;i<actual_message_length;i++) {

        *(dma_ptr+i) = *(my_ptr+i);



      actual_message_length = 0;






When this gets called by multiple threads, I end up with the transactions on the pcie bus interspersed (this is a dump of the lower 128-bit of transactions from a xilinx pcie endpoint):


Data = 009a0000 00000810 00000000 fbe00000

Data = 009a0001 00000810 00000000 fbe00040

Data = 009a0002 00000810 00000000 fbe00000

Data = 009a0003 00000810 00000000 fbe00040

Data = 009a0000 0000080a 00000000 fbe00080

Data = 009a0001 0000080a 00000000 fbe00080


If I add an sfence before the pthread spin unlock at the end this doesn’t happen.  I am trying to understand why this is the case. 


The fpga driver I am using is a simple uio driver so the memory region I am writing to on the fpga is mapped via dma_ptr.  Also I am aware that I have ‘over fenced’ this code but am a bit puzzled why I get the overlapping transactions.

0 Kudos
1 Reply
Honored Contributor III

My interpretation of the documentation would be that the MFENCE instructions would be (more than) enough to ensure the ordering you want, but the discussions of this topic are usually written from the perspective of processor coherence and not low-level IO ordering.   (I spent several years working on these sorts of issues when I was the technical lead for AMD's tightly-coupled accelerator program.  Mapping processor instructions to specific PCIe transactions has more complexity than one might guess, and if the PCIe transactions are not the ones you expect, then ordering of those transactions may also contain surprises.)

Section 11.3 of Volume 3 of the SWDM (document 325384-070) says:

If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

Section 8.2.5 "Strengthening or Weakening the Memory-Ordering Model" implies that MFENCE is a superset of SFENCE, so it should be enough to ensure coherence.

Section 22.34 "Store Buffers and Memory Ordering" notes that the SFENCE instruction should be used between weakly-ordered and normally-ordered stores, and this is what I see the Intel compiler generating after non-temporal stores.   SFENCE should be cheaper than MFENCE, but the low-level implementations might not have identical effects with regard to write-combining buffers.

Section 11.3.1 goes into a bit more detail (emphasis added):

When one or more WC buffers has been filled, the processor has the option of evicting the buffers to system memory. The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency. When using the WC memory type, software must be sensitive to the fact that the writing of data to system memory is being delayed and must deliberately empty the WC buffers when system memory coherency is required.

Again, this suggests that MFENCE should be enough to ensure ordering with respect to the processor coherency protocol, but it is not clear that this is guaranteed to have any implications for IO transactions to MMIO space.

It is conceivable that neither of these instructions actually guarantees the IO ordering that you want, but that including multiple FENCE transactions  puts a long enough delay between the WC stores of the different threads to eliminate the appearance of ordering violations in your tests.

0 Kudos