Hi everybody,We have a piece of software that uses a double buffering pipeline to overlap FPGA execution, PCI transfers, and CPU-side competition. The software is pretty throughput intensive, and at each cycle of the pipeline we do many PCI transfers, execute many many kernels on the FPGA. Obviously, the best would be to be able to keep the PCI bus and the FPGA as busy as possible. However, we know that our FPGA is equipped with only one DMA engine; thus, transfers from and to the FPGA are going to be serialized, but we can live with that. However, we see few issues: 1. Sometimes, when a PCI transfer is in-flight, no kernel execution is scheduled on the FPGA until the PCI transfer is concluded. Note that the two tasks are independent and the completion of one should not stall the execution other. (See the red rectangles in the attached first.png file.) 2. The vice-versa of the previous point also happens. Which is, no PCI transfers is executed until the current kernels on the FPGA concluded. (See the orange rectangle in the attached first.png file.) 3. [The major problem] We saw that after few hundreds of cycles (thousands of kernels and PCI transfers), time gaps from one task and another start to appear. Gaps that get worst and worst with the time and that become unacceptable at some time. Making a cycle lasts around 30 seconds instead than the usual 3.5/4 seconds. It's true that the software runs for around one hour performing hundreds of thousands of PCI transfers and kernel executions, but we believe this should not happen. And it does not occur when using a GPU. We have no idea whether this is a problem of the driver or of the FPGA itself but we are thinking to something like "the driver keeps a list of events/object, and it has to traverse all the list every time making it slower and slower the more task you do." (See the purple rectangle in the attached second.png file.) We would like to know if some of you experienced similar problems. Also, any idea, comment, and suggestion are welcome! OS: Centos 7.4 FPGA board: Nallatech 510T Software environment: IntelFPGA SDK v17.1 build 270 and BSP R001.005.0004
Following our intuition about problem# 3, we found an easy workaround for that issue: If we release and recreate (clreleasecommandqueue() and clcreatecommandqueue()) all the command queues at every cycle of our pipeline the problem disappears. We were not able to check whether this affects other boards or not; but, we will do it later this week I guess.