Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Write Combine Performance and Out of Order

CHUL_K_
Beginner
3,904 Views

Hi,

I am a newbie in Write Combine subject. I am measuring a burst IO write performance via mmap on 64 bit Linux and try to understand the WC issue on the IO memory write.  I have several basic questions about WC use for this purpose.

The following is my example test setup for a burst write with Write Combine mode enabled:

1. The device driver set IO memory region using ioremap_wc (MTRR).  This IO memory is the non prefetchable region. The PAT can be set with write combine or non cached flag.

2. The device driver provides mmap operation for the user space so that the user app can access IO memory, which is resided in the PCIe device, with _mm256_stream_si256.

3. The user program is keep writing 64/24 bytes data streams into IO memory and the data regions are well aligned in 64/24 byte boundary.

4. When the user app writes a 64 bytes burst data into IO memory, it will be called a non temporal instruction _mm256_stream_si256.

5. The CPU, i5 / i7, has 10 WC buffers that are 64 bytes size long per each WC buffer.

6. If possible, want to avoid the use of memory barriers that will degrade a performance of IO burst write operation.

Here are my basic questions regarding

1. Since _mm256_stream_si256 is a non temporal instruction, it cause a weak ordering.  If this IO memory region is assigned as a write only, is this really cause a reordering on PCIe write transaction?  When this can be happened?  Can I have an example that cause a reordering in this scenario?  Assume that the burst write IO memory address are incrementally changed in 64 bytes aligned value if 64 bytes burst write is called and in 24 bytes aligned value if 24 bytes burst write is called.

2. If two _mm256_stream_si256 instructions are used for 64 bytes burst write, what size of atomicity can it be guaranteed?

3. If memcpy or pointer operations is used instead of _mm256_stream_si256 function call, does this cause more chance to have a reordering and less chance of to have a write combining?

4. Is a memory barrier use a non avoidable in order to eliminate the out of order issue on PCIe transactions?

5. For PAT setup, does the IO burst write, on user app through mmap, give the same effect whether it is set a write combine or a non cached flag on? 

Thanks!

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
3,904 Views

These issues are discussed fairly clearly in Section 11.3.1 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-049, February 2014).   This makes it clear that writes can appear out of order when the Write-Combining memory type is used for system memory.  Memory-mapped IO is not mentioned explicitly, but I interpret the text as implying that the same out-of-order behavior can occur any time the write-combining buffers are used.  

It is important to note that the IO device must be able to accept the data either as a full 64 Byte write or as a combination of smaller PCIe writes.  There are two issues here:

  1. Even with 32 Byte AVX stores it takes 2 stores to fill a write-combining buffer.  If an interrupt occurs between the two stores to a single 64 Byte aligned region, the write-combining buffers will be flushed while only partially filled.  (Note that the architecture reserves the right to flush the buffers whenever it wants and in whatever order it wants, so disabling interrupts is not enough to prevent partial buffer flushes.)
  2. The architecture does not specify what size transactions are used when a partially filled write-combining buffer is flushed, so the device needs to be able to handle any combination.  On a specific platform you may find that only certain kinds of "short" writes are used to flush a partially full buffer, but since this is not architecturally specified, it may be completely different with a different processor model.

The specific issue of ensuring ordering to write-combining memory regions is discussed in an Intel White Paper:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/pcie-burst-transfer-paper.pdf

One question that needs to be addressed is why you need the transactions to appear in order on the PCIe interface?   Write-combining is usually used in cases for which the address being written provides the disambiguation between transactions, and if ordering is required it is typically only at the end of a block transfer.   In that case, for example, an SFENCE is only required between the last of the stores for the bulk transfer and whatever store has the side effect of telling the IO device that the block transfer is complete.  (Even then it may not be needed if the "finalizing" store is to an address in a memory range marked UC, since UC accesses should always be serialized with respect to all preceding and following memory references.)

View solution in original post

0 Kudos
10 Replies
McCalpinJohn
Honored Contributor III
3,905 Views

These issues are discussed fairly clearly in Section 11.3.1 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-049, February 2014).   This makes it clear that writes can appear out of order when the Write-Combining memory type is used for system memory.  Memory-mapped IO is not mentioned explicitly, but I interpret the text as implying that the same out-of-order behavior can occur any time the write-combining buffers are used.  

It is important to note that the IO device must be able to accept the data either as a full 64 Byte write or as a combination of smaller PCIe writes.  There are two issues here:

  1. Even with 32 Byte AVX stores it takes 2 stores to fill a write-combining buffer.  If an interrupt occurs between the two stores to a single 64 Byte aligned region, the write-combining buffers will be flushed while only partially filled.  (Note that the architecture reserves the right to flush the buffers whenever it wants and in whatever order it wants, so disabling interrupts is not enough to prevent partial buffer flushes.)
  2. The architecture does not specify what size transactions are used when a partially filled write-combining buffer is flushed, so the device needs to be able to handle any combination.  On a specific platform you may find that only certain kinds of "short" writes are used to flush a partially full buffer, but since this is not architecturally specified, it may be completely different with a different processor model.

The specific issue of ensuring ordering to write-combining memory regions is discussed in an Intel White Paper:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/pcie-burst-transfer-paper.pdf

One question that needs to be addressed is why you need the transactions to appear in order on the PCIe interface?   Write-combining is usually used in cases for which the address being written provides the disambiguation between transactions, and if ordering is required it is typically only at the end of a block transfer.   In that case, for example, an SFENCE is only required between the last of the stores for the bulk transfer and whatever store has the side effect of telling the IO device that the block transfer is complete.  (Even then it may not be needed if the "finalizing" store is to an address in a memory range marked UC, since UC accesses should always be serialized with respect to all preceding and following memory references.)

0 Kudos
CHUL_K_
Beginner
3,904 Views

John,

I searched this subject through online and found several articles and discussions about Write Combine subject but there isn't enough mentioning about out of ordering when CPU accesses IO memory.  According to Intel articles and your description, it seems to me that there is no way to avoid an elimination of memory fence regardless how user app access IO memory due to an uncertainty nature of Weak Ordering by using non temporal load/store instructions.

Thanks for your response!  It is very helpful !!!

0 Kudos
CHUL_K_
Beginner
3,904 Views

Hi John,

"One question that needs to be addressed is why you need the transactions to appear in order on the PCIe interface?   Write-combining is usually used in cases for which the address being written provides the disambiguation between transactions, and if ordering is required it is typically only at the end of a block transfer.   In that case, for example, an SFENCE is only required between the last of the stores for the bulk transfer and whatever store has the side effect of telling the IO device that the block transfer is complete.  (Even then it may not be needed if the "finalizing" store is to an address in a memory range marked UC, since UC accesses should always be serialized with respect to all preceding and following memory references.)"

Here is one of my question for reordering:

Let's assume that 64 bytes data contains two parts.  The first part is the pure data and the later part contains a flag that indicates whether a data is valid or not.  How to handle a partial write for IO memory via PCIe may be dependent on the PCIe controller design but I want to know how CPU handle WC in this case.  Is there any chance to be written the reverse order of data on PCIe device?  Assume that the PCIe device isn't that intelligent in terms or ordering.  is the IO transactions in CPU perspective always serialized in time as the user app instruct to?

Many thanks!

0 Kudos
McCalpinJohn
Honored Contributor III
3,904 Views

Section 11.3.1 of Vol 3 of the Intel Architectures Software Developer's Manual makes it very clear that WC writes are weakly ordered both between 64-Byte regions and within 64-Byte regions.  To quote (emphasis mine):

The WC memory type is weakly ordered by definition. Once the eviction of a WC buffer has started, the data is
subject to the weak ordering semantics of its definition. Ordering is not maintained between the successive allocation/
deallocation of WC buffers (for example, writes to WC buffer 1 followed by writes to WC buffer 2 may appear
as buffer 2 followed by buffer 1 on the system bus
). When a WC buffer is evicted to memory as partial writes there
is no guaranteed ordering between successive partial writes (for example, a partial write for chunk 2 may appear
on the bus before the partial write for chunk 1 or vice versa
).

It appears that there are some guarantees of atomicity, but they are not documented precisely here.  Again quoting from 11.3.1:

The only elements of WC propagation to the system bus that are guaranteed are those provided by transaction
atomicity. For example, with a P6 family processor, a completely full WC buffer will always be propagated as a
single 32-bit burst transaction using any chunk order. In a WC buffer eviction where data will be evicted as partials,
all data contained in the same chunk (0 mod 8 aligned) will be propagated simultaneously. Likewise, for more
recent processors starting with those based on Intel NetBurst microarchitectures, a full WC buffer will always be
propagated as a single burst transactions, using any chunk order within a transaction. For partial buffer propagations,
all data contained in the same chunk will be propagated simultaneously.

The phrase "transaction atomicity" does not appear in any Intel documentation that I can find, except in essentially this same sentence.  

Fortunately there is some description of atomicity in Section 8.1.1 of Vol 3 of the SW Developer's Manual.  Since this text seems to be authoritative, I will quote it here:

8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line
Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be
atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family,
Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel
Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split
accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and
should be avoided.
An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using
multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing
to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that
is marked “not present”). In this case, the effects of the completed accesses may be visible to software even
though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such
page faults may occur even if all accesses are to the same page.

So Intel is clear that no write larger than 8 Bytes is guaranteed to be atomic.  This suggests that it is possible that even a single 16-Byte or 32-Byte store to a write combining buffer can become visible (in this case by a premature buffer flush) in parts.  The text above suggests no guarantee that the parts will become visible in any particular order in this case -- e.g., a 16 Byte aligned store of a pair of double-precision values could become externally visible as either just the first element or just the second element.  Although the last section is deliberately as vague as possible, I am reasonably certain that it does not imply that the "multiple memory accesses" that might be used to implement SSE (or AVX) stores have the possibility to break atomicity of the fundamental data types.  In other words, even if the transaction is broken up, you will never get a case in which (for example) the upper 32 bits of a 64-bit double would become visible while the lower bits did not become visible (provided that the store was at least 64-bit-aligned).

Unfortunately we are not finished here.  To fully understand the issues related to trying to use write-combining to generate 64-Byte blocks that combine "data" and "flag" information, we also need to understand the ordering model used by the core.  Memory ordering is discussed in a fair degree of detail in section 8.2 of Vol 3 of the SW Developer's Manual.   In this case there is only one really important principle, and that is that writes are not reordered with respect to other writes except in three cases (quoting from section 8.2.2):

Writes to memory are not reordered with other writes, with the following exceptions:
— writes executed with the CLFLUSH instruction;
— streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ,
MOVNTDQ, MOVNTPS, and MOVNTPD); and
— string operations (see Section 8.2.4.1).

Since we are considering writes within a single 64 Byte (aligned) block, none of the caveats apply and we can assume that the processor will always execute stores into the write-combining buffer in program order.  As discussed above, this does not mean that the stores will become *externally visible* in program order, but the fact that they are executed in program order might still be helpful -- or it might not -- I am still working on the details.  The idea here is that the external device expects 64 byte buffers to be written, so it has to refrain from making decisions based on the flag variable until all 64 Bytes in a buffer have been received.  It is straightforward to imagine how such a buffer would be constructed for a fully programmable device like an FPGA -- the controller would maintain a count of the bytes received for each 64 Byte block in the buffer region and would maintain a separate array of "valid' bits for each byte in the line -- setting the bits to "valid" on receipt of each byte and clearing all valid bits when the data is read from the buffer (or when a partial write appears to overwrite a portion of the buffer).  The issues are trickier if you are writing to ordinary memory on the device, since you won't have any way to track the "valid" bits in hardware.

It is still possible to guarantee correctness using a couple of approaches.  I call one of these the "invalid encodings" approach.

The idea of "invalid encodings" is that the data will be initialized to values that are internally (i.e. within each 64-bit aligned value) marked as invalid. You begin by writing these values to the buffer, taking care not to write all 64 Bytes (because you don't want the buffer to flush).  You then overwrite these values with the actual data, then you write the flag.   Since the stores are executed by the processor in order, even if a prematurely flushed buffer shows up at the device, the bytes that are not updated will show up as these "invalid encodings".  Examples:

  • For 32-bit or 64-bit floating-point data it is often the case that many of the valid encodings for "NaN" (Not a Number) are not generated by the hardware.  You can pick one or more of these to indicate data that has not been updated.
  • For 64-bit pointers, the hardware requires that values in "canonical" form have the upper 16 bits as all 0's or all 1's.  Obviously any non-canonical encoding can be used by software to indicate a data item that has not been updated.
  • For 32-bit or 64-bit signed integer data, my preference is to reserve the most negative value as "invalid".  Although it is possible that this value will exist in data, it is rather unlikely, given that most hardware+software environments do not support hardware detection of underflow/wraparound of negative numbers -- so people don't write code that lets integral values get all the way to the limits.  Picking the most negative value as "invalid" also has the nice property of making the allowable range of positive and negative numbers symmetric.  (This latter point probably makes no difference in the real world, but it seems an aesthetic improvement.)
  • For packed integer data of smaller sizes I can't think of a general solution.

There are two approaches for dealing with the flag bit.  The easiest is to reserve multiple bytes for the flag bit, so that you can clear the flag without filling the buffer.  E.g.,

  • Clear the "valid" bit with an 8-bit write to the buffer.
  • Write "invalid encodings" to all the data locations (e.g., 7 doubles or 16 floats).
  • Write the actual data to all the data locations.
  • Set the "valid" bit with a 64-bit write (for an array of 7 doubles) or a 32-bit write (for an array of 15 floats).
    • This final write completes the writes to all 64 Bytes of the buffer, so it will cause an immediate buffer flush to the target.

Because the stores into the buffer occur in program order, there is no case in which the buffer can be prematurely flushed and contain "stale" data (i.e., valid encodings from a prior iteration of the communication) and a "valid" flag. 

The other approach to the flag bit is to assume that "valid" alternates between zero and one for alternating transactions.  This way you don't have to start by clearing it, since the previous value in the buffer is now considered to mean "invalid".  A premature buffer flush will not update the memory location holding the flag (since it has not been written to the buffer yet, those bytes cannot be written to the target), so the worst that can happen is that the target receives a full set of valid data but does not act on it because the buffer was flushed right before the flag was set to the new "valid" state.

With the "invalid encodings" approach it may appear that you don't need a flag (and you really don't), but you might as well use one because you can't write "invalid" data to the entire buffer or it will flush itself to the target -- wasting everyone's time.

There are probably some subtle mistakes in this last discussion -- this is the kind of work that requires very careful review and testing, and it is time for lunch.

0 Kudos
CHUL_K_
Beginner
3,904 Views

Hi John,

Your answers and suggestions are wonderful!  It help me a guide in right direction.

If I summarize your statements and Intel documents, it guarantees that a fundamental data type, like 64 bit long long integer, is an atomic operation and write instructions are executed and appeared in program order except three cases.  One exception is the use of non temporal instruction.  The reason to use a non-temporal instruction store command is to bypass the cache since the IO memory is a non cacheable and write only region.  However, it is known that this instruction can cause an weak ordering with WC enabled.  So, whether it use a non temporal instruction or not, there is a good and bad point.

I also thank you for your suggestions.  The reason what I ask those questions is to understand clearly how the out of ordering is carried out from CPU and what are the impacts and issues and finding a possibility of avoiding memory barrier during burst IO writes.  Since the reordering is done on WC buffer, I guess that it may have a certain internal time out to flush out the instructions from the WC buffer.  This time out is not a real time out but lack of available internal WC buffers.  Because it puts a data in new available WC buffer if the range of address isn't fit into the pending WC buffers including the current WC buffer.  If my assumption is correct, the WC buffer should hold a data n in consecutive memory address that can address in 64 byte aligned so that it may break up if the next write address is not in the range of that buffer can addressed.  In this case, which one should be flushed out may be dependent on the implementation of internal CPU design.  This cause an out of order.  Then I can take an account for side effect and issue during context switch occurred but it is still my assumption.

Thank you so much!

0 Kudos
McCalpinJohn
Honored Contributor III
3,904 Views

IMPORTANT NOTE:  I should have been more clear that my assumption that the processor actually performs the stores in program order is an assumption, not a fact.  The architecture guarantees that stores become visible in program order, but that does not mean that an implementation has to execute them internally in order.  Executing them internally in order is the easiest way to ensure that they can only become visible in order, but it is not the only way that this feature could be implemented.

My description of a way to work around the weak ordering rules of the WC memory type is based on the assumption that data gets written into a single WC buffer in program order, but that data might get flushed to the target in pieces or in a different order.   Even if the implementation is exactly the same, this means that there will be differences between WC stores to memory and WC stores to an IO device.

First consider what happens with WC stores to memory.  The normal case is a 64-Byte write to memory.  The memory controller simply computes the appropriate ECC bits and writes the line to the DRAM -- overwriting whatever data was there before.   If the WC buffer is flushed prematurely, the processor will send one or more shorter writes to the memory controller.  The memory controller will have to read the existing data in that cache line, merge in the new bytes, compute the new ECC bits, and write the data back to DRAM.  The memory controller may buffer the line to allow multiple partial cache line updates to be completed before doing the actual write back to DRAM, but that does not change the visibility of the data.  The protocol between the core and the memory controller is not publicly documented, but is very likely to include a rich enough set of transactions to enable atomic transfers of aligned values up to 8 Bytes long --- otherwise a native variable could show up at the memory in pieces, which would break the atomicity guarantees of Section 8.1.1.   So what another processor will see is not completely arbitrary -- it is limited to some combination of the previous contents of the line and a potentially random subset of the updates to the line (where the updates are assumed to be atomic for aligned variables up to 8 bytes).

Next consider what happens with WC stores to a memory-mapped IO device.  In the normal case, the flush of a full WC buffer results in a 64-Byte aligned write on the PCIe bus.   If the WC buffer is flushed prematurely, the processor's IO controller will generate one or more PCIe write transactions for the parts of the WC buffer than have been written.  It is important to note that the processor will only write the bytes that the user has already written -- bytes that have not yet been written to the WC buffer will not be included in the PCIe writes.   If the IO device to which this data is being written has its own processor, then opportunities exist to create new protocols.  For example, the processor on the IO device could clear the buffer memory (or fill it with "invalid encodings") after reading it (but before acknowledging to the host that the message had been received).  This might allow the development of a protocol that does not require separate "valid" flags, thus allowing the host processor to use the full 64 Bytes for data, even if the IO device does not have the ability to count the number of bytes written to each 64-Byte block.  (Having a PCIe bus analyzer would make the development of such protocols much easier -- there are a fair number of uncertainties and assumptions in this discussion and it would be very helpful to see what is actually happening on the bus.)

Appendix:

The PCIe specification defines the types of store transactions available, with the fundamental transfer type being one or more 4-Byte blocks on a 4-Byte boundary.  This is clearly sufficient to support atomic transfers of 4-byte and 8-byte naturally aligned values, but requires multiple transactions if the WC buffer contains multiple 4-Byte or 8-Byte fields that are not contiguous.  (This may be one reason why it is recommended that the WC buffer be written in contiguous order.)  The PCIe standard includes limited support for "byte enable" fields, which must be used when the data in the WC buffer being flushed is composed of discontiguous byte fields or fields that do not start on 4-Byte boundaries.  Again, the PCIe transaction types are sufficient, but when combined with the processor atomicity rules, the required transactions may not be the most efficient.

0 Kudos
CHUL_K_
Beginner
3,904 Views

Hi John,

 

This is a really good explanation about WC subject what I am looking for.

I think I have an enough understanding of WC subject and learned more detail on WC buffer handling issue on Intel CPUs.

Now, I can imagine that there are more complex stuff like branch prediction and speculation with WC buffer.

This WC knowledge will guide me go in right direction on my project.

 

Thanks again!!!

 

0 Kudos
Abraham__Alan
Beginner
3,904 Views

Hi John,

Is there any way to guarantee that the WC buffer is not prematurely flushed. Aligned with what you have explained above, is it possible to do this without using any valid bits?

Additionally, I have read in the latest optimization manual, page 338 section 8.6.1

You can declare small SWWC buffers (a cache line for each buffer) in your application to enable explicit write-combining operations. Instead of writing to non-temporal memory space immediately, the program writes data into SWWC buffers and combines them inside these buffers. The program only writes a SWWC buffer out using non-temporal stores when the buffer is filled up, that is, a cache line. Although the SWWC method requires explicit instructions for performing temporary writes and reads, this ensures that the transaction on the front-side bus causes line  transaction rather than several partial transactions. Application performance gains considerably from implementing this technique.

Here it mentions that SWWC technique ensures are full writes (ie. 64 bytes). Am I correct in my understanding here?

If so, how is this achieved? I did not find much information regarding SWWC buffers and how to use them to achieve this. 

Appreciate your help here.

Regards,

Alan

0 Kudos
McCalpinJohn
Honored Contributor III
3,904 Views

The wording of that section in the optimization reference manual is probably too strong.  Intel's hardware documentation makes it clear that the architecture does not guarantee that WC buffers will *never* be flushed while partially full.  The approach described will reduce the number of partial WC flushes to a very small fraction, but it is not guaranteed to be zero.   The most common cause of flushing incomplete WC buffers is interrupts, which cannot be completely eliminated.

It is *possible* that processors implementing the AVX512 instruction set will show atomic behavior if only aligned 512-bit non-temporal stores are used, but that would be an implementation detail, and not an architectural guarantee.  (I.e., Intel could change the behavior in other AVX512-compatible processors, and even change the behavior in your processor via a microcode update.)

0 Kudos
sshai17
Beginner
3,904 Views

Thank you for the information..!!!

0 Kudos
Reply