Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Anil_A_1
Beginner
429 Views

Make sure certain PCIe writes are 64bytes to improve the bus performance

Jump to solution

Have a use case where the CPU(Xeon, Haswell, E5-2658) has to write 64 bytes of data to the device connected over PCIe bus. On the CPU side, a user space application does a memcpy from a local buffer to the memory mapped address of the device. I believe the memcpy function might be copying 8bytes in turn and thus generating PCIe TLP layer packets with 8 bytes of data and other control overheads.

Is there a way to ensure that the 64bytes of data is packed into one PCIe TLP packet and written on the bus?

0 Kudos
1 Solution
McCalpinJohn
Black Belt
429 Views

There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)

Note that the memory type depends on both the MTRR and the PAT for the address in question.  This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016).  Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type.  One or more of these may be more convenient than the others in your particular situation.

Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions).  The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer.  The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully.  Some possible caveats:

  • Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.
  • Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.

In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable.  I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.

View solution in original post

4 Replies
Anil_A_1
Beginner
429 Views

I have a Intel IPS account, but not sure where to raise this question in terms of product and technology.

McCalpinJohn
Black Belt
430 Views

There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)

Note that the memory type depends on both the MTRR and the PAT for the address in question.  This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016).  Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type.  One or more of these may be more convenient than the others in your particular situation.

Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions).  The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer.  The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully.  Some possible caveats:

  • Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.
  • Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.

In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable.  I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.

View solution in original post

Anil_A_1
Beginner
429 Views

Thanks John for your inputs. 

Any idea how to achieve this on a Linux system, to make the MMIO addresses or a region of addresses to WC?

Regards, -Anil

McCalpinJohn
Black Belt
429 Views

The usual way to do this is in a device driver (that runs in the kernel).   To set up the mappings for the kernel to use, just use the "ioremap_wc()" interface and it will make sure that the MTRRs and PATs are set up correctly.     I think that recent kernels use "remap_pfn_range()" to create a mapping for user-space access to the MMIO area, but I have a great deal of trouble following all of the changes in the kernel function names and their ever-changing locations in the kernel source trees.

Aside: There are some weird comments in the Linux kernel documentation (Linux/Documentation/x86/mtrr.txt) about "phasing out" MTRRs -- this is grossly misleading and confusing.  Linux can't "phase out" MTRRs -- they are part of the hardware and they have to be programmed correctly.  I think what they are trying to say is that they are "phasing out" the explicit use of the MTRR interface.  This is perfectly reasonable -- the effective memory type is determined by the combination of the MTRR and PAT settings in a very complex way, and the kernel interfaces should be based on the desired memory type, with the MTRR and PAT handling done in a consistent manner at a lower level.
 

Reply