Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

doubt about MFENCE instruction

MChau9
Beginner
948 Views

Hello All,

can anybody confirm weather MFENCE instruction ensures committing the stores to final destination memory or it just flushes the stores out of CPU's local buffer to BIU.

Though intel software developers guide says MFENCE ensures global visibility of the preceding load/stores, but does it apply to IOMEM also.

 

Background : i've been writing program to calculate Latency for PIO write to PCIe based FPGA memory. my problem is how to ensure that PIO write is completed, since it is posted write.

Pseudo code is --

 

          1) open device

          2) mmap device memory to program address space

          3) clock-gettime(CLOCK_MONOTONIC, &start)

          4) PIO_write to mmap'ed memory

          5) _____________ (ensure write to destination memory)

          6) clock_gettime(CLOCK_MONOTONIC, &end)

          7) latency = end-start

 

I have two option for step 5)

          --> either i must assert the PIO read to FPGA memory after the PIO write to make sure PIO write completed as they are  processed in order.

                    problem with this is, it spoils the actual latency figure as PIO read adds lots of latency (overhead of PIO read is higher than PIO write)

          --> i can use MFENCE

                    Problem with this is , does it ensures data written to FPGA memory OR it only ensures writing is initiated (CPU handover data to TLP layer of PCIe and MFENCE returns)

 

is there something else available than these two. if no, which one of these is more justified for calculating high precision PIO write latency.

 

Any clarification is highly appreciated.

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
948 Views

My PCIe write throughput tests typically used 2MiB payloads.    Each cache line was written using 4 consecutive 128-bit nontemporal stores.    The processor only has a handful of Write-Combining buffers.  The actual number does not matter, it will be negligible compared to the 32768 cache lines in a 2MiB region.  So even if the ending timer is read before the write-combining buffers have flushed, the underestimate of the elapsed time could not be as large as 1/1000th of the measured time.

For shorter tests I would probably do tests with and without MFENCE, and would look at elapsed time as a function of transfer size to see if there is any indication of systematic variation from the expected timings.

If this is just a timer issue, then errors are a minor concern and can usually be bounded by testing with/without MFENCE and testing various sizes.   If it is a correctness issue -- e.g., you need to be sure that the data has been sent to the FPGA because you are going to turn off the power or something -- then you will need some sort of round trip handshaking.

View solution in original post

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
948 Views

I am not sure that the question you are asking is sufficiently precise (or that Intel has disclosed its implementations in sufficient detail to make it possible to ask such questions with sufficient precision....)  What do you mean by "completed" for a posted store?  Do you mean that the bits have shown up on the PCIe bus?  You can timestamp the receipt of the message in the FPGA, but you are going to need round trips to establish a correspondence between the clocks available to the processor core and the clocks available to the FPGA.

If the MMIO region is uncached, then the uncached store will be strongly ordered with respect to all other memory references.    Section 8.2.5 of Volume 3 of the Intel Architectures SW Developer's Manual discusses this with respect to the IN and OUT instructions, and suggests that the same degree of ordering applies to all load and store operations to memory regions of type UC.   UC memory stores are slow, but there is no reason to deliberately delay them (such as store buffers or write-combining buffers), since these operations are not allowed to be combined or re-ordered in any way.  It is possible that some implementation-dependent feature may delay to transmission of the bits from the core to the PCIe interface, but it seems unlikely that you would be able to do anything about it.

If the MMIO region is mapped as Write-Combining, then you are dealing with architecturally-defined write-combining buffers that are expected to delay the transmission of the bits from the core to the PCIe interface if the entire 64-Byte buffer is not written.  Intel's architectural documents are pretty careful to avoid any suggestion that you can "push" the contents of the write-combining buffers programmatically.  The write-combining buffers do eventually get "pushed", of course, but other than the approach of writing all 64 Bytes, Intel's definitions focus on whether or not subsequent (in program order) memory references are allowed to take place before the write-combining buffer is flushed.   The MFENCE instruction *probably* pushes any partially-filled WC buffers quickly, but it does not have to be implemented this way, and it quite possible that it *usually* pushes the partially-filled WC buffers quickly, but *sometimes* doesn't push them so quickly (because some implementation-dependent internal state makes that operation impossible to do "immediately").

This distinction between "time" and "order" might help you re-formulate your question.  You may not be able to determine "when" the PIO write is completed, but it is pretty easy to ensure that some other event does not happen until after the PIO write has completed.

 

 

0 Kudos
MChau9
Beginner
948 Views

Thank You Dr. Bandwidth, for your prompt and elaborate reply,

First, let me tell, I forgotten to mention that my MMIO is write combine memory type.  

I was just worried about the correctness of my benchmark. That, my data (PIO writes) should not be in flight (on PCIe bus) while I stop the timer.

OR is it okay to stop timer right after last PIO write and let the reliable PCIe transport delivers them to FPGA memory ultimately?

what I understand from your reply is, it is ok, if the data is 64 bytes and aligned to cache line address, since WC are flushed as soon as they gets filled [1].

In one of your authored technical paper from researchgate named “Low level micro benchmarks of processor to FPGA MMIO”, you quote the PIO write bandwidth saturates at 2.92GBps. However I can only reach to 2GBps at 4KB data size then after decreasing and saturates to 650MBps. May I know when do you stop your timer (call RDTSC second time) after writing data, in your benchmark.

I figured out following 3 options when I can take end timestamp

  1. Call one PIO read after all PIO writes and then subtract PIO read latency from result.

  2. Use MFENCE after all PIO writes.

  3. Write only data size which is in 64 bytes, because WCB are flushed as soon as they filled [1].

Which method do you think gives more precise and optimum readings?

PS:  my CPU supports constant_tsc and nonstop_tsc. Overhead of RDTSCP on my system (2.6GHz) is 46 cycles (17ns).

 

  1. Steen Larsen and Ben Lee, “Re-evaluation of Programmed I/O with Write-Combining buffers to Improve I/O Performance on Cluster Systems” in NAS 2015: 10th IEEE international conference on Networking, Architecture and storage at Boston, USA. Electronic ISBN: 978-1-4673-7891-8 in IEEE Xplore digital library. [section 3.A OR column 6]

 

0 Kudos
McCalpinJohn
Honored Contributor III
949 Views

My PCIe write throughput tests typically used 2MiB payloads.    Each cache line was written using 4 consecutive 128-bit nontemporal stores.    The processor only has a handful of Write-Combining buffers.  The actual number does not matter, it will be negligible compared to the 32768 cache lines in a 2MiB region.  So even if the ending timer is read before the write-combining buffers have flushed, the underestimate of the elapsed time could not be as large as 1/1000th of the measured time.

For shorter tests I would probably do tests with and without MFENCE, and would look at elapsed time as a function of transfer size to see if there is any indication of systematic variation from the expected timings.

If this is just a timer issue, then errors are a minor concern and can usually be bounded by testing with/without MFENCE and testing various sizes.   If it is a correctness issue -- e.g., you need to be sure that the data has been sent to the FPGA because you are going to turn off the power or something -- then you will need some sort of round trip handshaking.

0 Kudos
Reply