How to improve PCIe PIO Read Performance

ACL · ‎12-28-2020

Hİ All,

I'm working on developing user space pcie driver for xilinx fpga ip core on linux.

Mapping BAR0 register in kernel driver like below;

mmio_start = pci_resource_start(pdev, 0);
mmio_len = pci_resource_len(pdev, 0);
iomemm = ioremap_nocache(mmio_start, mmio_len);

and accessing it via mmap through user space.

fd = open(filename, O_RDWR | O_SYNC))
map_base = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, target_base);

clock_gettime(CLOCK_REALTIME, &start);
read_result = *((uint64_t *) virt_addr);
clock_gettime(CLOCK_REALTIME, &end);
diff = 1000000000L * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;

When I measure the 64 bit read latency it is always bigger than 2us,

clock_gettime(CLOCK_REALTIME, &start);
memread = _mm_stream_load_si128((__m128i *) virt_addr);
clock_gettime(CLOCK_REALTIME, &end);
diff = 1000000000L * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;

When I used Intrinsics (_mm_stream_load_si128) it is still like 2us.

How to improve PIO read performance here, all suggestions are welcome.

ACL · ‎12-28-2020

Additional System Info;

CPU -> Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz

> $ lspci -vvv

05:00.0 Ethernet controller: Silicom Denmark Device 0001
Subsystem: Silicom Denmark Device 0001
Physical Slot: 4
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at f7100000 (64-bit, non-prefetchable) [size=128K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [1a0 v1] Device Serial Number a3-c5-00-b8-84-49-94-80
Capabilities: [1c0 v1] #19
Kernel driver in use: fiberblaze

McCalpinJohn · ‎12-29-2020

If the space has to be mapped as UC, then I don't think that there is any way to improve read performance. The processor operates in an extremely serialized fashion in this mode in order to guarantee that no accesses are speculative and that all transactions are executed exactly once and in program order (and typically with no overlap between transactions). There is also very likely a limit on the payload size for each transaction -- probably 32 bits or 64 bits.

If you can relax the ordering to WC, this allows speculative reads, which allows concurrency and/or larger payloads. Sometimes it is possible to get a speedup via speculation alone (though I have had trouble reproducing the benefits), but it is better to use the MOVNTDQA instruction(which was designed precisely for this case of "streaming reads" from WC-mapped space). The best description of the details is in the MOVNTDQA instruction description in Volume 2 of the Intel Architectures Software Developers Manual.

I wrote up some notes on related topics at https://sites.utexas.edu/jdm4372/2013/05/

ACL · ‎12-31-2020

Thanks John for your explanation. I will try WC and MOVNTDQA instruction to see the performance there.

By the way let me to give more details on problem. I'm trying to handle data stream from FPGA to CPU, since latency is more important than bandwidth, decided to use PIO for directly accessing data instead of spending time on writing DMA descriptors and handling interrupts. CPU polling specific addresses on MMAP BAR and trying to get next available data in sequential order to handle stream operation. All cache related speculative reads will grab the data from (BAR) FIFO and CPU will needs to invalidate cache after consuming it.

Till it is clear what to do but it seems risky for maintaining coherence for MMAPIO. Do you have any other suggestion to keep latency low?

McCalpinJohn · ‎12-31-2020

This is pretty much the same approach that I put together for AMD's tightly-coupled accelerator program back in 2006-2008 (with some extensions discussed in my blog entries referenced above).

Latency is certainly problematic if you don't have a commitment from all of the engineering teams to support a particular architecture for core-to-device communication and synchronization. At AMD I typically worked with HyperTransport, rather than PCIe, mostly because I only had one layer of translation (from core instructions to HyperTransport interface transactions) to deal with. Going to PCIe would require a second translation layer designed and implemented by another design team (and in a different chip at the time). With Intel processors, transactions leaving the core+L1+L2 block generate IDI transactions on the mesh or ring interface targeting an IO block. The IDI protocol (which stands for "Intra-Die Interconnect" or "In-Die Interface" depending on the reference) is not well documented in public, though there is enough information in the uncore performance monitoring manuals to reverse-engineer many of the features. Unfortunately, understanding the protocol is only part of the problem -- it is often necessary to get into the guts of the implementation to understand how to best address performance issues.

When attempting to control communication on the interface between a core and an attached "device", there are lots of combinations of transaction [size, speculation, ordering] to consider, and processors typically only directly support a few of the combinations. The "loose" end is optimized for cache-line blocks, full speculation, limited ordering. The "tight" end is pessimized for correctness in all of the remaining cases -- 8-32-bit accesses, no speculation, full (serialized) ordering. In between you have WC, WT, WP memory modes which can sometimes provide a modest benefit. Of these, only WC is widely used, and almost exclusively in device drivers. None of them do what is really required for a tightly-coupled accelerator interface....

The problem is fundamentally an architectural one. The architecture of almost all current processors is the result of decades of "hacks" on the single-core, flat-memory architecture of the first microprocessors. Despite the many cores on a die, there is no explicit architectural support for communication. Instead, the architecture specifies a set of "ordering rules" that allow communication to occur as a side-effect of ordered memory writes. If we want communication and synchronization to be fast (for both cores and devices), the architecture must contain specific communication functions that are distinct from cache-coherent memory functions, so that the hardware can provide optimized implementations for these very different use cases.

Manually-maintained cache coherence can be tricky. Accesses to anything other than UC space are allowed to be speculative (past a predicted branch or prefetched based on access pattern or just randomly selected to be loaded), so side effects cannot be allowed on reads. Speculation allows concurrency, but also allows the core to drop the data and request it again. There is no architectural support for the kind of metadata that would allow a core to "tag" a read as being a repeat of a previous read (to suppress or undo a speculative side effect), or to allow a read response to contain a valid/invalid bit (to allow immediate response of invalid data, e.g., reading from an empty queue), or any of the myriad architectures used by hardware designers (but not accessible from the ISA). For cached MMIO (using the WT or WP memory types), the core can always perform a "flush after use" on cache line. For streaming loads from WC space, it may take some experimentation to find a way to ensure that stale data does not stay in the core's read buffer after the device has modified the target line. The description of the MOVNTDQA instruction gives some clues, but these are typically not definitive.

Good luck with your project. It seems like you are on the right track, but don't be surprised if something occasionally does not work for inexplicable reasons.

ACL · ‎12-31-2020

Thanks a lot for your guidance John. I will try to keep you updated with the progress.

ACL · ‎01-29-2021

Hi John,

With PCIe Gen 2 and 4X Lane configuration with link speed 5GTs. I used movntdqa from sse2 intruction set with WC memory type. It allows me to read in 64B chunks. Single 64B read still took 1us. It improves performance by factor 8 when you compare with UC. But still the read TLPs goes in serialized fashion way like UC memory. Not able to create speculative read to initiate concurrency on WC.

I tried to crate multiple reader threads on different cores but still no concurrency on pcie read TLP. When listen read operations from the CPU, I saw that cores not able to initiate multiple read request to the bus. Only one read TLP at a time.

I'm curious about if I'm doing smth wrong since WC memory type says allowing speculative reads. Do you have any idea regarding this?

Best Regards,

McCalpinJohn · ‎01-29-2021

This is not really surprising -- "allowing" speculation does not mean that an implementation will actually perform the speculation. Some implementations might speculate and other implementations might not. Similarly, Intel's description of MOVNTDQA "allows" multiple 64-Byte buffers but does not mandate them.

(I recall a case in which a processor was intended to support concurrent cache line transfers from MMIO space, but for which bug-fix microcode updates reduced the maximum concurrency to a single transfer. In another case, the processor supported a maximum of two concurrent cache line transfers from MMIO space (mapped WT or WP), with no documentation on the reason for the limit.)

The limitation of read concurrency with the WC memory type is the reason I developed a double-mapped approach. Device memory is mapped into system space twice -- once for read accesses (mapped WP or WT) and once for write accesses (mapped WC). Coherence has to be handled explicitly. This is discussed in two blog posts: