Solved: The usual way to do this is

Anil_A_1 · ‎09-13-2016

Have a use case where the CPU(Xeon, Haswell, E5-2658) has to write 64 bytes of data to the device connected over PCIe bus. On the CPU side, a user space application does a memcpy from a local buffer to the memory mapped address of the device. I believe the memcpy function might be copying 8bytes in turn and thus generating PCIe TLP layer packets with 8 bytes of data and other control overheads.

Is there a way to ensure that the 64bytes of data is packed into one PCIe TLP packet and written on the bus?

McCalpinJohn · ‎09-14-2016

There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)

Note that the memory type depends on both the MTRR and the PAT for the address in question. This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016). Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type. One or more of these may be more convenient than the others in your particular situation.

Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions). The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer. The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully. Some possible caveats:

Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.
Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.

In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable. I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.

View solution in original post

Anil_A_1 · ‎09-13-2016

I have a Intel IPS account, but not sure where to raise this question in terms of product and technology.

McCalpinJohn · ‎09-14-2016

There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)

Note that the memory type depends on both the MTRR and the PAT for the address in question. This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016). Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type. One or more of these may be more convenient than the others in your particular situation.

Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions). The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer. The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully. Some possible caveats:

Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.
Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.

In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable. I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.

Anil_A_1 · ‎09-14-2016

Thanks John for your inputs.

Any idea how to achieve this on a Linux system, to make the MMIO addresses or a region of addresses to WC?

Regards, -Anil

McCalpinJohn · ‎09-15-2016

The usual way to do this is in a device driver (that runs in the kernel). To set up the mappings for the kernel to use, just use the "ioremap_wc()" interface and it will make sure that the MTRRs and PATs are set up correctly. I think that recent kernels use "remap_pfn_range()" to create a mapping for user-space access to the MMIO area, but I have a great deal of trouble following all of the changes in the kernel function names and their ever-changing locations in the kernel source trees.

Aside: There are some weird comments in the Linux kernel documentation (Linux/Documentation/x86/mtrr.txt) about "phasing out" MTRRs -- this is grossly misleading and confusing. Linux can't "phase out" MTRRs -- they are part of the hardware and they have to be programmed correctly. I think what they are trying to say is that they are "phasing out" the explicit use of the MTRR interface. This is perfectly reasonable -- the effective memory type is determined by the combination of the MTRR and PAT settings in a very complex way, and the kernel interfaces should be based on the desired memory type, with the MTRR and PAT handling done in a consistent manner at a lower level.

Make sure certain PCIe writes are 64bytes to improve the bus performance