Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Make sure certain PCIe writes are 64bytes to improve the bus performance

Anil_A_1
Beginner
1,859 Views

Have a use case where the CPU(Xeon, Haswell, E5-2658) has to write 64 bytes of data to the device connected over PCIe bus. On the CPU side, a user space application does a memcpy from a local buffer to the memory mapped address of the device. I believe the memcpy function might be copying 8bytes in turn and thus generating PCIe TLP layer packets with 8 bytes of data and other control overheads.

Is there a way to ensure that the 64bytes of data is packed into one PCIe TLP packet and written on the bus?

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,859 Views

There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)

Note that the memory type depends on both the MTRR and the PAT for the address in question.  This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016).  Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type.  One or more of these may be more convenient than the others in your particular situation.

Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions).  The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer.  The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully.  Some possible caveats:

  • Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.
  • Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.

In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable.  I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.

View solution in original post

0 Kudos
4 Replies
Anil_A_1
Beginner
1,859 Views

I have a Intel IPS account, but not sure where to raise this question in terms of product and technology.

0 Kudos
McCalpinJohn
Honored Contributor III
1,860 Views

There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)

Note that the memory type depends on both the MTRR and the PAT for the address in question.  This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016).  Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type.  One or more of these may be more convenient than the others in your particular situation.

Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions).  The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer.  The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully.  Some possible caveats:

  • Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.
  • Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.

In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable.  I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.

0 Kudos
Anil_A_1
Beginner
1,859 Views

Thanks John for your inputs. 

Any idea how to achieve this on a Linux system, to make the MMIO addresses or a region of addresses to WC?

Regards, -Anil

0 Kudos
McCalpinJohn
Honored Contributor III
1,859 Views

The usual way to do this is in a device driver (that runs in the kernel).   To set up the mappings for the kernel to use, just use the "ioremap_wc()" interface and it will make sure that the MTRRs and PATs are set up correctly.     I think that recent kernels use "remap_pfn_range()" to create a mapping for user-space access to the MMIO area, but I have a great deal of trouble following all of the changes in the kernel function names and their ever-changing locations in the kernel source trees.

Aside: There are some weird comments in the Linux kernel documentation (Linux/Documentation/x86/mtrr.txt) about "phasing out" MTRRs -- this is grossly misleading and confusing.  Linux can't "phase out" MTRRs -- they are part of the hardware and they have to be programmed correctly.  I think what they are trying to say is that they are "phasing out" the explicit use of the MTRR interface.  This is perfectly reasonable -- the effective memory type is determined by the combination of the MTRR and PAT settings in a very complex way, and the kernel interfaces should be based on the desired memory type, with the MTRR and PAT handling done in a consistent manner at a lower level.
 

0 Kudos
Reply