We have an implementation of an FPGA device (intel stratix5) connected to Broadwell-D (1559 ) using x8 PCIe GenIII interface. The FPGA consists 8 input channels of aprox.250MB/s (total aggregated ~2GB/s input data) and we are trying to "Push" to the CPU RAM using DMA engine inside the FPGA- So the overall theoretical bandwidth of the PCIe is about x4 times of the data rate (8GB/s).
For some reason, the actual bandwidth we are able to achieve is much less than expected – about 500MB/s – only 2 channels of 8 can work in full bandwidth.
Our implementation consist 16 DMA channels ( 2 for each channel ) . each physical channel consist 2 separated DMA for the header and the data itself – this is for application needs.
Each DMA asks application a memory window of 24MB and push data continuously while providing an interrupt for each 4MB of data sent . this allows the CPU to fetch and clear the data-buffer before it gets re-written again in the next cycle. This means that for each chnnel the CPU gets around 60 interrupts per second and should be getting ~480 interrupts per second when operating the 8 channels.
So – our basic questions are :
- Looking at this implementation method – is there something in this method that can explain the low performance ?
- Is there a profiling software that can help us understand better what is the pipeline so maybe we will be able to solve it?
1, I guess the multiple channel DMA is designed by yourself, I know nothing about your design, so I don't have any suggestion for your design.
2, I suggest you try polling scheme instead of interrupt scheme. I suspect so many interrupts might not be handled by system timely. It might be the reason of low performance.
3, You also can use Signaltap to capture some signals and check the timing relation to help analyze the issue.