How come we cannot achieve PCIe write speeds above 250MB/sec on PCIe Gen 2 x4 interface using the Arria 10 PCIe Hard IP? We should be at 2000 MB/sec.

vtoka · ‎09-07-2018

We are using a CPU and FPGA (Arria 10) system that communicate via PCIe Gen 2.0 x4 lanes. On the FPGA side there is a ddr3 module. Doing simple write tests we get speeds that max out at 250 MB/sec. Considering our setup we should be getting up to 2000 MB/sec. The DDR3 is not to blame because I get the same speeds with On-Chip memory. I've played around with all sort of settings in the PCIe Hard IP and cannot get the speeds any higher (I can make them lower etc.). I am using the Avalon-MM with DMA interface in the IP. Is there a fundamental concept we are missing or some connection on the IP? Is there something on the CPU side we are not doing? Any suggestions on why we are only at like 10% capacity? Any suggestions or pointers will help tremendously, thank you!

SengKok_L_Intel · ‎09-12-2018

Hi,

The theoretical throughput for PCIE Gen2 X4 is 2GB/s.

From AN829, the Cyclone 10 PCIE Gen2 X4 achieve 1.66GB/S, the performance numbers are lower than the theoretical numbers due to DMA performance limitation and the way the throughput is measured.

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an829.pdf

Typical factors affecting Throughput:

Application logic - does not write data fast enough to the HIP or can't sink data fast enough from the HIP
PCIe link stability - The link has high BER which causes it to go to Recovery frequently, reducing the bandwidth of the link
Host - does not return credit back to the FPGA fast enough or has a long latency to return back to the FPGA

General Debug flow to understand link performance:

Determine the direction of data - Data moves from host to the FPGA or vice versa
Determine the initiator of the transfer - Host or the FPGA initiates the transfer
Consider how the performance is measured - measured by hardware or software

For example:

Symptom-> Host writes data to the FPGA too Slow

Root cause -> Rx buffer for posted TLP in the HIP is too small

Debug -> Use external PCIe analyzer to check if the host needs to wait for the credit from the HIP for each transfer.

Potential Solution -> Change RX buffer allocation in Qsys GUI to high or Max

Regards -SK Lim (Intel)