Can I read and write buffers simultaneously with OpenCL.

MN_ · ‎12-17-2018

Hello, I am using the FPGA board de5net from Terasic with a Stratix V FPGA. It is connected the to host via PCIe.

To obtain a hight throughput I want to read an write buffers simultaneously but unfortunately openCL executes the commands consecutively. So is it possible to read/write simultaneously at all and if yes, how can I do it?

I have one device and create one context for it. Furthermore I create one queue for the read command (clEnqueueReadBuffer) and another for the write command (clEnqueueWriteBuffer). Both are called as non-blocking (CL_FALSE).

Nooraini_Y_Intel · ‎12-27-2018

Hi,

Currently I am reviewing the forum for any open questions and found this thread. I apologize that no one seems to answer this question that you posted. Since it has been a while you posted this question, I'm wondering if you have found the answer? If not, please let me know, I will try to assign/find someone to assist you. Please do expect some delay in response as most of our agents are out of office due to the year-end holidays. Thank you.

Regards,

Nooraini

MN_ · ‎01-07-2019

Hi,

thank you very much for your answer.

Unfortunately I have not found an answer for my problem. I already use two different queues and perform the read/write operations in a non-blocking fashion as HRZ recommends it. But when I analyze the execution times, I can see that all commands where executed consecutively.

I also asked the Terasic suppoort but they didn't answer yet.

So I really would appreciate it if you could help me.

Regards

MN.

Nooraini_Y_Intel · ‎01-08-2019

Hi MN.,

Noted. I will need time to assign/find someone to assist you. Please do expect some delay in response from the assigned agent.

Thank you.

Regards,

Nooraini

HRZ · ‎01-12-2019

@MN. Can you post a psuedo code, or better yet, all of your host code so that we can take a look at it and give you a more concrete solution?

MN_ · ‎01-15-2019

Hi,

here is my host program. It calls a kernel which does nothing.

The program creates two buffers for writing and two for reading. Then it performs the following tasks:

Write the first buffer
a) Write the second buffer, b) execute the kernel on the first buffer
a) Read the result of the first kernel execution, b) execute the kernel on the second buffer, c) write new data into the first buffer.
a) Read the result of the second kernel execution, b) execute the kernel on the first buffer
a) Read the result of the third kernel execution

The tasks in every line should be performed simultanously. But from the following times you can see, that the tasks were performed consecutively:

Start[ns] End[ns] Duration[ns] Transfer rate [GB/s]

1. Write: 0 391938194 391938194 1.27571

1. Kernel: 392105739 2001510415 1609404676 0.310674

1. Read: 2474213273 2768932996 294719723 1.69653

2. Write: 2001533350 2289509772 287976422 1.73625

2. Kernel: 2768969091 3323809512 554840421 0.90116

2. Read: 4051312696 4233617466 182304770 2.74266

3. Write: 2289595551 2474174001 184578450 2.70888

3. Kernel: 3323830854 4051289376 727458522 0.687324

3. Read: 4233661025 4415740370 182079345 2.74606

Do you know, how to overlap the tasks?

Thank you very much.

HRZ · ‎01-17-2019

Indeed it seems everything is serialized even though when I look at your code, it seems correct to me. I am not sure about the implementation of the "checkError" function since it is in a header. Try commenting out the error checking, maybe it has some serialization effect. I have personally managed to parallelize execution of two different kernels on one FPGA using separate queues; I would assume it should also work for two buffer operations unless there is some limitation in Altera/Intel's run-time (e.g. not allowing two simultaneous DMA operations).

HRZ · ‎12-28-2018

Indeed you can read and write from/to two different buffer simultaneously. You should create two queues, one for each operation, and set the third parameter in clEnqueueRead/WriteBuffer to CL_FALSE so that the operations are performed in a non-blocking fashion. You can then use OpenCL events to synchronize the operations. Also this topic should probably be in the “FPGA Design Tools” section.

MuhammadAr_U_Intel · ‎01-09-2019

Hi @MN.

I am looking at this thread now, may I know what version of compiler you are using ?

Also I understand you are looking at the execution by analyzing OpenCL profiler result, is this correct ?

Thanks,

Arslan

MN_ · ‎01-17-2019

Hello,

I'm using

Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler

Version 18.1.0 Build 625 Standard Edition

and

g++ (GCC) 5.4.0

on CentOS 6.10.

Yes I'm using the OpenCL profiler, to analyze the result.

Thanks MN

MN_ · ‎01-17-2019

During the work with my FPGA Board (DE5net from Terasic with a StratixV FPGA from 2012) there occured a few more questions:

I want to implement an algorithm for bioinformatics. According to our theoretical analysis the transfer rate from the host to the board via PCIe will be the bottleneck. So I need to stream into the FPGA with the peak performance of PCIe. Furthermore I have to execute several kernel parallel. So is it possible, the my board is simply too old and the available OpenCL implementation does not satisfy this demand? If my board is too old, do you could redommend me one which is more appropriate for me?

Furthermore I'm wondering if you have a redbook for programming this board with OpenCL and a full documentation of the exact implementation of OpenCl?

Because the OpenCL Implementation for my board does only support PCIe gen2 I took also a look at Intel HLS because there PCIe gen3 is available. Do you have a full documentation of it? I didn't find any.

JBorr6 · ‎03-03-2021

Hello MN,

I am having the same concern with you. Were you able to solve this problem?

JBorr6

HRZ · ‎01-18-2019

If you are sure your bottleneck is going to be the PCI-E transfer, there is no point in accelerating your application on a PCI-E-attached accelerator, be it FPGA, GPU or anything else. Running it on a CPU could be the best solution since the PCI-E transfer will be avoided. Furthermore, all OpenCL-capable Stratix V and Arria 10 boards that I know of are limited to 8x PCI-E while you can at least get 16x PCI-E on nearly all GPUs from the past few years which means they will be a better option for you.

The reason why you cannot run your kernels simultaneously likely has very little to do with the board you are using. There is either something in your host code preventing your kernels from running in parallel or there is some limitation in Altera/Intel's OpenCL run-time which is board-independent. As I mentioned in my previous reply, I have personally run kernels in parallel successfully on the same board. You can find the design here (v8 kernel):

https://github.com/fpga-opencl-benchmarks/rodinia_fpga/tree/35b061f6b9c976dc44f86d6c2bd007c756c64349/opencl/lud/ocl

Are you sure the OpenCL implementation of your board only supports PCI-E Gen 2.0? My Stratix V board is installed on a machine that only supports Gen 2.0 and hence, it has to run at Gen 2.0, but my Arria 10 board runs at Gen 3.0 on a newer motherboard without any issue. Maybe your motherboard doesn't support Gen 3.0?

Terasic's documentation for the DE5-Net board are here:

https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=158&No=526&PartNo=4

Intel FPGA SDK for OpenCL's documents are here:

https://www.intel.com/content/www/us/en/programmable/products/design-software/embedded-software-developers/opencl/support.html

Intel HLS documents are here:

https://www.intel.com/content/www/us/en/programmable/products/design-software/high-level-design/intel-hls-compiler/support.html

These links include all the official documents available.

MN_ · ‎01-21-2019

@NYusof

Do you know, if Intel plans to support OpenCL PCIe 16x gen3 in future, and if yes, when do you think it's available?

Thanks

MN

HRZ · ‎01-21-2019

The currently-available Stratix 10 boards physically support 16x PCI-E Gen 3.0. Not sure what the OpenCL driver supports, though.

Nooraini_Y_Intel · ‎01-22-2019

Hi @MUsman

Can you help to address this question from @MN. ?

"Do you know, if Intel plans to support OpenCL PCIe 16x gen3 in future, and if yes, when do you think it's available?"

Thank you,

Regards,

Nooraini

MuhammadAr_U_Intel · ‎01-28-2019

Looking at the Stratix10 Devkit initialization guide provided with latest version ofcompiler 18.1, I can see the "aocl diagnose" result shows Gen3x8.

We don't have information when PCIe Gen3 x16 will be supported.

Thanks,

Arslan