Hello, I am using the FPGA board de5net from Terasic with a Stratix V FPGA. It is connected the to host via PCIe.
To obtain a hight throughput I want to read an write buffers simultaneously but unfortunately openCL executes the commands consecutively. So is it possible to read/write simultaneously at all and if yes, how can I do it?
I have one device and create one context for it. Furthermore I create one queue for the read command (clEnqueueReadBuffer) and another for the write command (clEnqueueWriteBuffer). Both are called as non-blocking (CL_FALSE).
Currently I am reviewing the forum for any open questions and found this thread. I apologize that no one seems to answer this question that you posted. Since it has been a while you posted this question, I'm wondering if you have found the answer? If not, please let me know, I will try to assign/find someone to assist you. Please do expect some delay in response as most of our agents are out of office due to the year-end holidays. Thank you.
Indeed you can read and write from/to two different buffer simultaneously. You should create two queues, one for each operation, and set the third parameter in clEnqueueRead/WriteBuffer to CL_FALSE so that the operations are performed in a non-blocking fashion. You can then use OpenCL events to synchronize the operations. Also this topic should probably be in the “FPGA Design Tools” section.
thank you very much for your answer.
Unfortunately I have not found an answer for my problem. I already use two different queues and perform the read/write operations in a non-blocking fashion as HRZ recommends it. But when I analyze the execution times, I can see that all commands where executed consecutively.
I also asked the Terasic suppoort but they didn't answer yet.
So I really would appreciate it if you could help me.
I am looking at this thread now, may I know what version of compiler you are using ?
Also I understand you are looking at the execution by analyzing OpenCL profiler result, is this correct ?
here is my host program. It calls a kernel which does nothing.
The program creates two buffers for writing and two for reading. Then it performs the following tasks:
The tasks in every line should be performed simultanously. But from the following times you can see, that the tasks were performed consecutively:
Start[ns] End[ns] Duration[ns] Transfer rate [GB/s]
1. Write: 0 391938194 391938194 1.27571
1. Kernel: 392105739 2001510415 1609404676 0.310674
1. Read: 2474213273 2768932996 294719723 1.69653
2. Write: 2001533350 2289509772 287976422 1.73625
2. Kernel: 2768969091 3323809512 554840421 0.90116
2. Read: 4051312696 4233617466 182304770 2.74266
3. Write: 2289595551 2474174001 184578450 2.70888
3. Kernel: 3323830854 4051289376 727458522 0.687324
3. Read: 4233661025 4415740370 182079345 2.74606
Do you know, how to overlap the tasks?
Thank you very much.
Indeed it seems everything is serialized even though when I look at your code, it seems correct to me. I am not sure about the implementation of the "checkError" function since it is in a header. Try commenting out the error checking, maybe it has some serialization effect. I have personally managed to parallelize execution of two different kernels on one FPGA using separate queues; I would assume it should also work for two buffer operations unless there is some limitation in Altera/Intel's run-time (e.g. not allowing two simultaneous DMA operations).
Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler
Version 18.1.0 Build 625 Standard Edition
Copyright (C) 2018 Intel Corporation
g++ (GCC) 5.4.0
on CentOS 6.10.
Yes I'm using the OpenCL profiler, to analyze the result.
During the work with my FPGA Board (DE5net from Terasic with a StratixV FPGA from 2012) there occured a few more questions:
I want to implement an algorithm for bioinformatics. According to our theoretical analysis the transfer rate from the host to the board via PCIe will be the bottleneck. So I need to stream into the FPGA with the peak performance of PCIe. Furthermore I have to execute several kernel parallel. So is it possible, the my board is simply too old and the available OpenCL implementation does not satisfy this demand? If my board is too old, do you could redommend me one which is more appropriate for me?
Furthermore I'm wondering if you have a redbook for programming this board with OpenCL and a full documentation of the exact implementation of OpenCl?
Because the OpenCL Implementation for my board does only support PCIe gen2 I took also a look at Intel HLS because there PCIe gen3 is available. Do you have a full documentation of it? I didn't find any.
If you are sure your bottleneck is going to be the PCI-E transfer, there is no point in accelerating your application on a PCI-E-attached accelerator, be it FPGA, GPU or anything else. Running it on a CPU could be the best solution since the PCI-E transfer will be avoided. Furthermore, all OpenCL-capable Stratix V and Arria 10 boards that I know of are limited to 8x PCI-E while you can at least get 16x PCI-E on nearly all GPUs from the past few years which means they will be a better option for you.
The reason why you cannot run your kernels simultaneously likely has very little to do with the board you are using. There is either something in your host code preventing your kernels from running in parallel or there is some limitation in Altera/Intel's OpenCL run-time which is board-independent. As I mentioned in my previous reply, I have personally run kernels in parallel successfully on the same board. You can find the design here (v8 kernel):
Are you sure the OpenCL implementation of your board only supports PCI-E Gen 2.0? My Stratix V board is installed on a machine that only supports Gen 2.0 and hence, it has to run at Gen 2.0, but my Arria 10 board runs at Gen 3.0 on a newer motherboard without any issue. Maybe your motherboard doesn't support Gen 3.0?
Terasic's documentation for the DE5-Net board are here:
Intel FPGA SDK for OpenCL's documents are here:
Intel HLS documents are here:
These links include all the official documents available.
Looking at the Stratix10 Devkit initialization guide provided with latest version ofcompiler 18.1, I can see the "aocl diagnose" result shows Gen3x8.
We don't have information when PCIe Gen3 x16 will be supported.