Solved: Write Combining Buffer Out of Order Writes and PCIe

bmeardon · ‎02-07-2017

Hello all,

I have a program that wants to send the device (an FPGA) a set of 8-byte qwords. It does this by only using qword stores into the I/O mapped memory region marked as WC. All appears to be fine when all of the qword stores occur in order with no gaps. The device receives all PCIe Write TLPs with base addresses that are 8-byte aligned. Below is the assembly code that writes 6 qwords contiguously into the WC region (who's base address is 64-byte aligned and stored in the rdx register) and then flushes the WCB with a sfence:

movq %rcx, (%rdx)
movq %r10, 8(%rdx)
movq %r9, 16(%rdx)
movq %r8, 24(%rdx)
movq %r11, 32(%rdx)
movq %rax, 40(%rdx)
sfence

~99% of the time the device receives these 6 qwords in a single 48 byte Write TLP. The other ~1% of the time the WCB is flushed prematurely (presumably due to an interrupt, etc.) and the 6 qwords arrive in more than 1 Write TLP with all of the base addresses being 8-byte aligned, as would be expected since there a no writes to anything but 8-byte aligned addresses.

Consider if there is any reordering of the stores (by the compiler) so that they are no longer strictly contiguous like the assembly below:

movq %rcx, (%rdx)
movq %r10, 8(%rdx)
movq %r9, 16(%rdx)
movq %r11, 32(%rdx)
movq %r8, 24(%rdx)
movq %rax, 40(%rdx)
sfence

Here the store to 4th qword (at offset 24) has been moved after the store to the 5th qword (at offset 32). This in effect creates an 8-byte "hole" at offset 24 if an interrupt were to occur after the 5th qword write and a partial WCB eviction occurred. In this case, I would have hoped to get two Write TLPs: one of 24-bytes covering the first 3 qwords and another of 8-bytes for the qword at offset 32 (I think this could end up in as many as 4 TLPs). I'd then expect to get one or two more Write TLPs for the remaining two qwords. Regardless of how its broken up, I'd expect to always be getting Write TLPs with base addresses aligned to 8-bytes containing data of at least one qword. I am basing a lot of this from my understanding of Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 3 - Section 11.3.1 Buffering of Write Combining Memory Locations.

However, when I run the example that has the non-contiguous qword stores, I sometimes see a Write TLP arrive that has a base address that is not 8-byte aligned, but 4-byte aligned. I've run the test on both a Haswell and Broadwell-E box. Interestingly I do not see any non 8-byte aligned Write TLPs on Haswell, but I do very consistently see the problem on the Broadwell-E machine.

Do you have any idea as to why these 4-byte (non 8-byte) aligned Write TLPs would be generated when I'm only every writing 8-byte/qwords to 8-byte aligned addresses? I've poured over the Intel documentation and whatever else I could find online. There are some sources out there that say to avoid holes/non-contiguous writes, but really only cite possibly performance impacts (like more TLPs) and nothing like this. I appreciate any insight you may have.

Thanks,

Brandon

McCalpinJohn · ‎02-08-2017

My statement about "Intel's cautions not to make any assumptions about the specific types of PCIe transactions that will be used when WC buffers are flushed" was my high-level summary based primarily on my experience with transaction ordering (i.e., the 2nd to last paragraph of section 11.3.1 of Volume 3 of the SWDM), and less about granularity or alignment. I had forgotten about the 8-Byte "chunksize" comments that I quoted in my second response, and I don't know how to reconcile the 8-Byte chunksize statements with your observations. My interpretation of the "rules" for WC transactions comes primarily from Sections 11.3.1 and 8.1.1 ("Guaranteed Atomic Operations") of Volume 3 of the SWDM. Sometimes there are additional useful details in the instruction descriptions from Volume 2 of the SWDM, but in this case I don't see anything that would override Volume 3 of the SWDM.

I would be interested to hear more about the transaction types and ordering that you are seeing on the Broadwell-E in the case with the re-ordered stores:

Does this still usually send all 48 Bytes in a single transaction?
When it is not sending all 48 Bytes in a single transaction, what are the transactions used?
- Does it ever coalesce the first 24 Bytes into a single transaction?
When it sends a 4-Byte-aligned transaction
- is this a 4 Byte payload?
- does it send the other 4 Bytes of that 8-Byte "chunk" in a separate 4-Byte payload (but 8-Byte-aligned)?
Do you ever see the "Byte enable" fields used?

When you have even numbers of 8-Byte fields, it might be helpful to pack these into SIMD registers and use SIMD stores.

This reduces the number of store instructions by a factor of 2 or 4, depending on the payload size and alignment.
- With AVX-512 it will allow a full cache line with a single store.
  - This might be implemented in a way that provides atomicity -- even if that is not guaranteed.
  - It could be tested now on Xeon Phi x200 (Knights Landing), but that is probably not a very interesting host.
  - Presumably this will be supported by the "Skylake Xeon" when it appears -- probably some time this year.
I typically use the "non-temporal" versions of the MOV (store) instructions in this case as a reminder that these are stores to a write-combining region.

If you are operating in kernel space, do you disable interrupts for this section of code? I don't think this changes the rules, but it may change the probabilities of the different classes of transactions....

For processors based on the "server" uncore, the conversion of store instructions in a core to external PCIe write transactions should typically occur in two steps (with alternatives discussed below). First, there is a conversion of the core's store instruction to a store instruction in the (undocumented & proprietary) protocol used by the "ring" in the uncore of the processor. The ring transports the instruction and payload to the "R2PCIe" box, which provides the interface between the ring and IO. The block diagrams in the Xeon E5 Uncore Performance Monitoring Guides show that the R2PCIe box is connected to the IIO box, which is connected to both the PCIe and DMI2 (Southbridge) interfaces. The conversion of the transaction from the ring protocol to PCIe occurs somewhere in the R2PCIe -> IIO -> PCIe path. Since these are all internal boxes, it does not really matter where this happens -- we can't see it or do anything about it.

The general flow discussed above has two alternatives:

In a multi-socket system the core generating the store may not be in the same package that the PCIe device is attached to. In this case the "ring" transaction must be converted to QPI, transferred to the target package, then converted back to the ring protocol.
1. Both of these protocols are proprietary, so we can't say anything definitive, but...
2. It seems likely that the two protocols were designed so that most transactions will be transported via 1:1 transformations.
3. It is possible that some ring transactions don't have perfect QPI transaction matches.
  1. This could mean that some ring transactions might need to be split into multiple QPI transactions.
    1. The multiple QPI transactions might be coalesced into a single ring transaction on the target chip, or
    2. the mutliple QPI transactions might generate multiple ring transactions on the target chip.
  2. It could mean that some aspects of transaction ordering are modified by a QPI "hop".
    1. For correctness, this would have to be a modification in the direction of "more strongly ordered".
    2. If the transaction on the ring on the target uncore inherits a stronger ordering than the transaction on the original ring, this could result in the generation of different PCIe transactions.
The FPGA target could be attached to a PCIe interface on the Southbridge chip, rather than to a PCIe interface on the processor chip.
1. This introduces the possibility of protocol transformations of the same class as discussed above for tunneling across QPI.

It does not look like you specified which "Haswell" box you are working with. The "Broadwell-E" processors appear to be based on the "Broadwell-EP" (Xeon E5 v4), but with the QPI interfaces disabled and ECC memory support disabled. If this is the case, then I would expect the transaction flow for IO to be the same as on the server parts. If your Haswell system is based on the "client" uncore, then there could be significant differences in the transaction flow. There is not even a guarantee that the ring protocol is identical on "client" and "server" parts, since the client parts don't have to support QPI and multi-socket cache coherence.

View solution in original post

McCalpinJohn · ‎02-07-2017

This is a bit weird, but in line with the PCIe standard (which requires a minimum alignment of 4 Bytes), and with Intel's cautions not to make any assumptions about the specific types of PCIe transactions that will be used when WC buffers are flushed.

The presence of 4-byte-aligned addresses on the Broadwell-E box is probably some historical artifact of the implementation. These might also occur on other processors with different address patterns --- it would be nearly impossible to rule out the possibility without access to the processor logic (and might be challenging even then!).

McCalpinJohn · ‎02-07-2017

In researching a related topic, I ran across a post I made in 2014: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/518062#comment-1793829

In that post, I quoted from Section 11.3.1 of Volume 3 of the Intel Architectures SW Developer's Manual, and I just double-checked the most recent version to verify that the same wording applies. Two quotes from that section:

If one or more of the WC buffer’s bytes are invalid (for example, have not been written by software), the processor will transmit the data to memory using “partial write” transactions (one chunk at a time, where a “chunk” is 8 bytes).

[...]

In a WC buffer eviction where data will be evicted as partials, all data contained in the same chunk (0 mod 8 aligned) will be propagated simultaneously.

These quotes certainly seem to say that aligned 8-Byte writes into a Write-Combining memory area should only appear in transactions with payloads that are multiples of 8 Bytes (with corresponding 8-Byte alignment).

On the other hand, the newest architecture mentioned in this section is "NetBurst", so it is possible that the behavior was changed without updating the documentation.

bmeardon · ‎02-08-2017

Hi John,

First of all, thanks for the quick and useful response. I just have a couple of questions.

In regards to this statement from your first comment:

This is a bit weird, but in line with the PCIe standard (which requires a minimum alignment of 4 Bytes), and with Intel's cautions not to make any assumptions about the specific types of PCIe transactions that will be used when WC buffers are flushed.

Can you cite any Intel documentation or whitepaper that make any statement to the effect that say no assumptions can be made about the PCIe Write TLPs that will be generated (beyond the minimum requirement of 4-byte aligned base addresses)? I haven't been able to find anything, but then again I'm searching through thousand page manuals and I may not be using the right keywords.

It would seem that the worst case scenario in a partial WCB eviction would be a partial write transaction for each 8-byte chunk according to this excerpt from Vol 3 - Section 11.3.1, which you cite in your comment:

If one or more of the WC buffer’s bytes are invalid (for example, have not been written by software), the processor will transmit the data to memory using “partial write” transactions (one chunk at a time, where a “chunk” is 8 bytes).

This will result in a maximum of 4 partial write transactions (for P6 family processors) or 8 partial write transactions (for the Pentium 4 and more recent processors) for one WC buffer of data sent to memory.

Section 11.3.1 then goes on in the last paragraph to say (which you also cite in your comment):

The only elements of WC propagation to the system bus that are guaranteed are those provided by transaction atomicity. For example, with a P6 family processor, a completely full WC buffer will always be propagated as a single 32-bit burst transaction using any chunk order. In a WC buffer eviction where data will be evicted as partials, all data contained in the same chunk (0 mod 8 aligned) will be propagated simultaneously. Likewise, for more recent processors starting with those based on Intel NetBurst microarchitectures, a full WC buffer will always be propagated as a single burst transactions, using any chunk order within a transaction. For partial buffer propagations, all data contained in the same chunk will be propagated simultaneously.

Honestly, I'm not sure what to make of all of this, but I think we agree that it seems to read that an 8-byte level of "chunk" atomicity can be relied upon. So with all this talk about everything in processor being 8-byte oriented at a minimum, the question is where does it get split up into a 4-byte PCIe Write TLP? Would you happen to know the particular unit in the system (CPU or otherwise) that would basically take these "partial write" bus transactions of 8-byte "chunks" and turn them into what will be the PCIe Write TLP? Presumably there is a difference in this unit between Haswell and Broadwell.

Thanks again,

Brandon

McCalpinJohn · ‎02-08-2017

My statement about "Intel's cautions not to make any assumptions about the specific types of PCIe transactions that will be used when WC buffers are flushed" was my high-level summary based primarily on my experience with transaction ordering (i.e., the 2nd to last paragraph of section 11.3.1 of Volume 3 of the SWDM), and less about granularity or alignment. I had forgotten about the 8-Byte "chunksize" comments that I quoted in my second response, and I don't know how to reconcile the 8-Byte chunksize statements with your observations. My interpretation of the "rules" for WC transactions comes primarily from Sections 11.3.1 and 8.1.1 ("Guaranteed Atomic Operations") of Volume 3 of the SWDM. Sometimes there are additional useful details in the instruction descriptions from Volume 2 of the SWDM, but in this case I don't see anything that would override Volume 3 of the SWDM.

I would be interested to hear more about the transaction types and ordering that you are seeing on the Broadwell-E in the case with the re-ordered stores:

Does this still usually send all 48 Bytes in a single transaction?
When it is not sending all 48 Bytes in a single transaction, what are the transactions used?
- Does it ever coalesce the first 24 Bytes into a single transaction?
When it sends a 4-Byte-aligned transaction
- is this a 4 Byte payload?
- does it send the other 4 Bytes of that 8-Byte "chunk" in a separate 4-Byte payload (but 8-Byte-aligned)?
Do you ever see the "Byte enable" fields used?

When you have even numbers of 8-Byte fields, it might be helpful to pack these into SIMD registers and use SIMD stores.

This reduces the number of store instructions by a factor of 2 or 4, depending on the payload size and alignment.
- With AVX-512 it will allow a full cache line with a single store.
  - This might be implemented in a way that provides atomicity -- even if that is not guaranteed.
  - It could be tested now on Xeon Phi x200 (Knights Landing), but that is probably not a very interesting host.
  - Presumably this will be supported by the "Skylake Xeon" when it appears -- probably some time this year.
I typically use the "non-temporal" versions of the MOV (store) instructions in this case as a reminder that these are stores to a write-combining region.

If you are operating in kernel space, do you disable interrupts for this section of code? I don't think this changes the rules, but it may change the probabilities of the different classes of transactions....

For processors based on the "server" uncore, the conversion of store instructions in a core to external PCIe write transactions should typically occur in two steps (with alternatives discussed below). First, there is a conversion of the core's store instruction to a store instruction in the (undocumented & proprietary) protocol used by the "ring" in the uncore of the processor. The ring transports the instruction and payload to the "R2PCIe" box, which provides the interface between the ring and IO. The block diagrams in the Xeon E5 Uncore Performance Monitoring Guides show that the R2PCIe box is connected to the IIO box, which is connected to both the PCIe and DMI2 (Southbridge) interfaces. The conversion of the transaction from the ring protocol to PCIe occurs somewhere in the R2PCIe -> IIO -> PCIe path. Since these are all internal boxes, it does not really matter where this happens -- we can't see it or do anything about it.

The general flow discussed above has two alternatives:

In a multi-socket system the core generating the store may not be in the same package that the PCIe device is attached to. In this case the "ring" transaction must be converted to QPI, transferred to the target package, then converted back to the ring protocol.
1. Both of these protocols are proprietary, so we can't say anything definitive, but...
2. It seems likely that the two protocols were designed so that most transactions will be transported via 1:1 transformations.
3. It is possible that some ring transactions don't have perfect QPI transaction matches.
  1. This could mean that some ring transactions might need to be split into multiple QPI transactions.
    1. The multiple QPI transactions might be coalesced into a single ring transaction on the target chip, or
    2. the mutliple QPI transactions might generate multiple ring transactions on the target chip.
  2. It could mean that some aspects of transaction ordering are modified by a QPI "hop".
    1. For correctness, this would have to be a modification in the direction of "more strongly ordered".
    2. If the transaction on the ring on the target uncore inherits a stronger ordering than the transaction on the original ring, this could result in the generation of different PCIe transactions.
The FPGA target could be attached to a PCIe interface on the Southbridge chip, rather than to a PCIe interface on the processor chip.
1. This introduces the possibility of protocol transformations of the same class as discussed above for tunneling across QPI.

It does not look like you specified which "Haswell" box you are working with. The "Broadwell-E" processors appear to be based on the "Broadwell-EP" (Xeon E5 v4), but with the QPI interfaces disabled and ECC memory support disabled. If this is the case, then I would expect the transaction flow for IO to be the same as on the server parts. If your Haswell system is based on the "client" uncore, then there could be significant differences in the transaction flow. There is not even a guarantee that the ring protocol is identical on "client" and "server" parts, since the client parts don't have to support QPI and multi-socket cache coherence.

bmeardon · ‎02-08-2017

Thanks for taking the time to put together all that detail and thought. Let me see if I can answer the questions you had first:

Does this still usually send all 48 Bytes in a single transaction?
- A: Yes, at least 99% of time
When it is not sending all 48 Bytes in a single transaction, what are the transactions used?
- Does it ever coalesce the first 24 Bytes into a single transaction?
- A: I don't now the exact transactions as I only have some aggregate counters and assertions being reported from the FPGA. So I have counts on how many times the Write TLP is not 8-byte aligned and how many Write TLPs were received over the entire test, which takes around a billion messages to reproduce say a couple dozen of these non 8-byte aligned packets. From the counts, its clear that many times the 48 byte message is split into one or more Write TLPs without issue though.
When it sends a 4-Byte-aligned transaction
- is this a 4 Byte payload?
  - A: Unfortunately, I'm not sure.
- does it send the other 4 Bytes of that 8-Byte "chunk" in a separate 4-Byte payload (but 8-Byte-aligned)?
  - A: Not sure
Do you ever see the "Byte enable" fields used?
- A: And again, I'm not sure

We are working on getting some better diagnostic info/counters in the FPGA to answer some of these questions to give more info on the PCIe packets that are under aligned.

Those are good points about using wider SIMD based stores - i.e. SSE, AVX, etc. It is the intent to optimize the software side to use more of those.

This code runs in user space; so there is no disabling of interrupts.

Those are very interesting details around how the bus transactions move into the "uncore" parts of the socket and along the way into becoming a PCIe packet. I do think there possibly some relevant differences from our Haswell vs Broadwell machines. Here are the model details:

Broadwell (E): Core i7-6950X - single socket with 10 cores
Haswell (Devil's Canyon): Core i7-4790K - single socket with 4 cores

On the surface, both of these are marketed as "high end desktop", but I wonder if the Haswell (Devil's Canyon) is "client" uncore and the Broadwell (E) is actually "server" uncore (even though its not technically branded Xeon)?

I'm going to follow up to see if there are any differences between the two machine regarding how the FPGA is connected: directly to the CPU or through the South Bridge.

Thanks,
Brandon

McCalpinJohn · ‎02-08-2017

It looks like the Haswell Core i7-4790K uses the "client" uncore. From the data at https://en.wikipedia.org/wiki/List_of_Intel_Core_i7_microprocessors, it looks like this is a "Haswell-DT" model. I don't follow the non-server parts very closely, so I don't understand the details of what the "DT" in "Haswell-DT" means in comparison to prior products, but the giveaway is the 2 DRAM channels. Parts with the "server" uncore have more than 2 DRAM channels (usually 4, sometimes 3).

bmeardon · ‎02-09-2017

Looks like our Haswell machine is "client" uncore (LGA 1150 socket) and our Broadwell machine is "server" uncore (LGA 2011 socket). The FPGA is connected directly into the processor's PCIe interface (not over DMI to the PCH) on both our Haswell and Broadwell machines. It appears that the key difference is likely client vs server uncore and how it translates bus transactions into PCIe packets. These are my observations (again, all stores are qword/8-byte wide):

The Haswell client uncore seems to always generate 8-byte aligned PCIe Write TLPs for non-contiguous 8-byte aligned stores while the Broadwell server uncore seems to not always generate 8-byte aligned PCIe Write TLPs (i.e. 4-byte aligned).
Both the Haswell client uncore and Broadwell server uncore appear to always generate 8-byte aligned PCIe Write TLPs when all the 8-byte aligned stores are contiguous from low to high address.

It would be nice if there was some bit of documentation out there from Intel that made some statements around what one could minimally expect for PCIe Write TLPs generated out of a WC I/O mapped memory region. I think the documentation is clear about the possibility of reordering (as you pointed out), but appears to be a bit misleading about the minimal alignment/size that can be depended on given the statements around the smallest bus transaction being an 8-byte "chunk". If all that can be relied on is the minimal 4-byte address alignment of PCIe, it would be quite helpful to see a statement to that effect somewhere. Maybe in a section about how memory bus transactions can be translated into PCIe packets.

Thanks,

Brandon

McCalpinJohn · ‎02-09-2017

The 4-Byte-aligned transactions are not the worst possible case, but I think they are the worst "reasonable" case. The PCIe specification requires 4-Byte-aligned memory addresses, and uses the Byte Enable bits to disable beginning and/or ending Bytes for accesses that need a granularity and/or alignment of less than 4 Bytes. So it would be possible to dump the WC buffer using single-Byte writes, but that would be more work than just sticking with the minimum 4-Byte alignment and granularity. You could probably get Byte-sized transactions if you split your writes into single-Byte writes....

bmeardon · ‎02-13-2017

Thanks for all of the help John. We never write single bytes into the WC buffer, but if we did I'm sure you're right about having the possibility of some byte enable bits being turned off in the PCIe packet. Having the 8-byte writes split into 4-byte PCIe packets on the Broadwell-E box was unexpected, but doesn't appear to happen if we write in strictly ascending address order. We probably should change the FPGA logic to support 4-byte PCIe packets to be completely safe, but that's quite a bit of work. I think we're going to have to perform extensive testing on any machine architecture that we need to support (which we control and thankfully is limited to two at the moment) for the time being.