FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP

Slow PCIe cycles

Honored Contributor II

I'm seeing very slow PCIe cycles from a PPC master to Altera FPGA Avalon MM slave peripherals. Typical times are 4us for a read and 7us for a write - which seem extremely pedestrian to me! 


Does anyone know is these values are typical, or is there something very awry in our system. 


Since these are all single 32bit transfers (read/write of uncached locations) are we likely to see a significant difference is we use the newer 32 bit slave only PCIe block? 


If I try hard enough I can map the memory cached and arrange to do cache line read and writes (32 bytes) for buffer transfers, but I was hoping to get a moderate throughput without having to do that, and that doesn't help with accesses to control words.
0 Kudos
6 Replies
Honored Contributor II

Hello DSL, Did you get your PCIe problem resolved? 


I have inherited a project that inlcudes a Cycone IV GX part acting as a bridge between PCIe and an 8 bit A/D bus. 


A single byte read cycle takes less than 200ns on the A/D bus, but 3.5 us on the PCIe end. 


Any tips on where to begin looking to improve the PCIe throughput would be appreciated. 


The Altera part is connected via PCIe to an ATOM based system running Windows 7.
Honored Contributor II

Only by getting the host to request more bytes in each PCIe request. 

In my case the host is a small ppc running linux - and I was able to write a device driver for the dma engine embedded in the ppc's pcie block. 


Not sure what you can do with an atom and windows 7. 

You won't be able to affect the latency, but you can improve the throughput, reading into a wider register will probably still be a single PCIe request. You might be able to use of the the XMM3 (or whatever they are called) simd integer registers to read more bytes in one transfer. 

(Do save/restore the register though.)
Honored Contributor II

I have measured RC read -> Endpoint internal memory,  

Endpoint read -> RC ( system ) memory 

RC read -> Endpoint Configuration register. 


All show long latencies and the last one is a reference since the Configuration register is a local register and no fabric or endpoint memory read is involved. 

Depending on the PCIe core clock , I'm thinking the turn around number of clocks at the Endpoint may be 40 clocks .  


So what we have in the case of the Configuration register reads … is say 483 nS and the PCIe core is running at 100 Mhz so that translates to approx.. 483/10 = 48 approx. clocks. I measure this from when the Configuration Read is seen on the downstream link X1, Gen1, to when the Read Completion is first seen on the upsteam link. 


Now for my problem, I figured it was time to run this in ModelSim but have not been successful. 


1. I get errors when in QSYS Generate if I select "Testbench Simulation Model " = Verilog per the PCIe user guide example.. ( with none .. Generate runs error free.) 

2. I am able to run Modelsim after do msim_setup.tcl , ld_debug , run 140000 ns , however the simulation runs immediately in apparently immediately. 

I guess the environment is set up but an actual test_bench is missing that would go through training , config reads, memory read/ writes ... I was thinking such an  

example would exist .. 


Any ideas on 1 or 2 ? 


Thanks, Bob.
Honored Contributor II



I experiment the same kind of slow reads (big >2.0us latency) (no latency for writes). 

Did you finally explain your issue and found a way of reducing this latency. I am not expert on PCIe but see in the documentations some QoS capabilities? 




Honored Contributor II

Hi Yann, 


I did three measurements ... RC -> Endpoint IMEM read, Endpoint -> RC DDR3 read and finally RC -> Endpoint config register read. 


The final measurement was the reality check where I was reading a local register to the endpoint PCIe IP. 

This takes out any fabric and actual memory read time. 


From what I have been told, the latency can be from 100nS to 1uS .. that is probably for a high speed link. 

I was running Gen1 X1 and measured about 1uS and had full observability of when the Read request was received and when the Read completion started to be sent.... so for me I was able to decouple the actual link physical transport times. See attached image. 


Back to the third case of RC -> Endpoint config register read ... if the Hard IP is running at say 50 Mhz and the processing time is about 1uS, then 1000 nS/20nS = approx. 50 clocks . That would be 50 clocks to parse / check the TLP read command, get the register to be read, put it in an output queue, form the TLP for the Read Completion and send it to the PIPE I/F then have the PHY start to send it out. 50 clocks may be reasonable If the PCIe IP is running at 100 Mhz , then it would be 100 clocks which sounds alot for processing a read request. 


You can make some similar measurements .... to see. 2uS is on the high side ... is it a Read being processed by the RC ... in which case , try to make sure there is no contending "fabric" traffic. 


From a performance perspective, read latency by the RC is a factor ... one improvement is in the number of outstanding reads you can support. 


Best Regards, Bob.
Honored Contributor II

Possibility to improve latency would be: 

1) Increase the application clock frequency to 125MHz (if existing setting is only 62.5MHz). 

2) Configure the core to higher speed. 

3) Configure the core to wider lanes. 


Memory read is longer latency because it's compulsory to return the Completion packet.