Application Acceleration With FPGAs
Programmable Acceleration Cards (PACs), DCP, FPGA AI Suite, Software Stack, and Reference Designs
477 Discussions

Can I read and write buffers simultaneously with OpenCL.

MN_
Beginner
7,274 Views

Hello, I am using the FPGA board de5net from Terasic with a Stratix V FPGA. It is connected the to host via PCIe.

To obtain a hight throughput I want to read an write buffers simultaneously but unfortunately openCL executes the commands consecutively. So is it possible to read/write simultaneously at all and if yes, how can I do it?

 

I have one device and create one context for it. Furthermore I create one queue for the read command (clEnqueueReadBuffer) and another for the write command (clEnqueueWriteBuffer). Both are called as non-blocking (CL_FALSE).

0 Kudos
16 Replies
Nooraini_Y_Intel
Employee
2,370 Views

​Hi,

 

Currently I am reviewing the forum for any open questions and found this thread. I apologize that no one seems to answer this question that you posted. Since it has been a while you posted this question, I'm wondering if you have found the answer? If not, please let me know, I will try to assign/find someone to assist you. Please do expect some delay in response as most of our agents are out of office due to the year-end holidays. Thank you.

 

Regards,

Nooraini

0 Kudos
MN_
Beginner
2,370 Views

Hi,

 

thank you very much for your answer.

Unfortunately I have not found an answer for my problem. I already use two different queues and perform the read/write operations in a non-blocking fashion as HRZ recommends it. But when I analyze the execution times, I can see that all commands where executed consecutively.

 

I also asked the Terasic suppoort but they didn't answer yet.

 

So I really would appreciate it if you could help me.

 

Regards

MN.

0 Kudos
Nooraini_Y_Intel
Employee
2,370 Views

Hi MN.,

 

Noted. I will need time to assign/find someone to assist you. Please do expect some delay in response from the assigned agent.

Thank you.

 

Regards,

Nooraini

0 Kudos
HRZ
Valued Contributor III
2,370 Views

@MN.​ Can you post a psuedo code, or better yet, all of your host code so that we can take a look at it and give you a more concrete solution?

0 Kudos
MN_
Beginner
2,370 Views

Hi,

 

here is my host program. It calls a kernel which does nothing.

The program creates two buffers for writing and two for reading. Then it performs the following tasks:

  1. Write the first buffer
  2. a) Write the second buffer, b) execute the kernel on the first buffer
  3. a) Read the result of the first kernel execution, b) execute the kernel on the second buffer, c) write new data into the first buffer.
  4. a) Read the result of the second kernel execution, b) execute the kernel on the first buffer
  5. a) Read the result of the third kernel execution

The tasks in every line should be performed simultanously. But from the following times you can see, that the tasks were performed consecutively:

 

               Start[ns]         End[ns]     Duration[ns]  Transfer rate [GB/s]

1. Write:             0      391938194       391938194        1.27571

1. Kernel:    392105739     2001510415     1609404676       0.310674

1. Read:     2474213273     2768932996       294719723        1.69653

2. Write:    2001533350     2289509772       287976422       1.73625

2. Kernel:  2768969091     3323809512        554840421       0.90116

2. Read:     4051312696     4233617466       182304770        2.74266

3. Write:    2289595551     2474174001       184578450        2.70888

3. Kernel:  3323830854     4051289376       727458522       0.687324

3. Read:    4233661025     4415740370       182079345        2.74606

 

Do you know, how to overlap the tasks?

Thank you very much.

 

 

 

 

0 Kudos
HRZ
Valued Contributor III
2,370 Views

Indeed it seems everything is serialized even though when I look at your code, it seems correct to me. I am not sure about the implementation of the "checkError" function since it is in a header. Try commenting out the error checking, maybe it has some serialization effect. I have personally managed to parallelize execution of two different kernels on one FPGA using separate queues; I would assume it should also work for two buffer operations unless there is some limitation in Altera/Intel's run-time (e.g. not allowing two simultaneous DMA operations).

0 Kudos
HRZ
Valued Contributor III
2,370 Views

Indeed you can read and write from/to two different buffer simultaneously. You should create two queues, one for each operation, and set the third parameter in clEnqueueRead/WriteBuffer to CL_FALSE so that the operations are performed in a non-blocking fashion. You can then use OpenCL events to synchronize the operations. Also this topic should probably be in the “FPGA Design Tools” section.

0 Kudos
MuhammadAr_U_Intel
2,370 Views

Hi @MN.​ 

 

I am looking at this thread now, may I know what version of compiler you are using ?

 

Also I understand you are looking at the execution by analyzing OpenCL profiler result, is this correct ?

 

Thanks,

Arslan

0 Kudos
MN_
Beginner
2,370 Views

Hello,

 

I'm using

 

Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler

Version 18.1.0 Build 625 Standard Edition

Copyright (C) 2018 Intel Corporation

 

and

 

g++ (GCC) 5.4.0

 

on CentOS 6.10.

 

Yes I'm using the OpenCL profiler, to analyze the result.

 

Thanks MN

 

0 Kudos
MN_
Beginner
2,370 Views

During the work with my FPGA Board (DE5net from Terasic with a StratixV FPGA from 2012) there occured a few more questions:

 

I want to implement an algorithm for bioinformatics. According to our theoretical analysis the transfer rate from the host to the board via PCIe will be the bottleneck. So I need to stream into the FPGA with the peak performance of PCIe. Furthermore I have to execute several kernel parallel. So is it possible, the my board is simply too old and the available OpenCL implementation does not satisfy this demand? If my board is too old, do you could redommend me one which is more appropriate for me?

 

Furthermore I'm wondering if you have a redbook for programming this board with OpenCL and a full documentation of the exact implementation of OpenCl?

 

Because the OpenCL Implementation for my board does only support PCIe gen2 I took also a look at Intel HLS because there PCIe gen3 is available. Do you have a full documentation of it? I didn't find any.

 

 

0 Kudos
JBorr6
Novice
2,195 Views

Hello MN,

I am having the same concern with you. Were you able to solve this problem?

JBorr6

 

0 Kudos
HRZ
Valued Contributor III
2,370 Views

If you are sure your bottleneck is going to be the PCI-E transfer, there is no point in accelerating your application on a PCI-E-attached accelerator, be it FPGA, GPU or anything else. Running it on a CPU could be the best solution since the PCI-E transfer will be avoided. Furthermore, all OpenCL-capable Stratix V and Arria 10 boards that I know of are limited to 8x PCI-E while you can at least get 16x PCI-E on nearly all GPUs from the past few years which means they will be a better option for you.

 

The reason why you cannot run your kernels simultaneously likely has very little to do with the board you are using. There is either something in your host code preventing your kernels from running in parallel or there is some limitation in Altera/Intel's OpenCL run-time which is board-independent. As I mentioned in my previous reply, I have personally run kernels in parallel successfully on the same board. You can find the design here (v8 kernel):

 

https://github.com/fpga-opencl-benchmarks/rodinia_fpga/tree/35b061f6b9c976dc44f86d6c2bd007c756c64349/opencl/lud/ocl

 

Are you sure the OpenCL implementation of your board only supports PCI-E Gen 2.0? My Stratix V board is installed on a machine that only supports Gen 2.0 and hence, it has to run at Gen 2.0, but my Arria 10 board runs at Gen 3.0 on a newer motherboard without any issue. Maybe your motherboard doesn't support Gen 3.0?

 

Terasic's documentation for the DE5-Net board are here:

https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=158&No=526&PartNo=4

 

Intel FPGA SDK for OpenCL's documents are here:

https://www.intel.com/content/www/us/en/programmable/products/design-software/embedded-software-developers/opencl/support.html

 

Intel HLS documents are here:

https://www.intel.com/content/www/us/en/programmable/products/design-software/high-level-design/intel-hls-compiler/support.html

 

These links include all the official documents available.

0 Kudos
MN_
Beginner
2,370 Views

@NYusof

Do you know, if Intel plans to support OpenCL PCIe 16x gen3 in future, and if yes, when do you think it's available?

 

Thanks

MN

 

0 Kudos
HRZ
Valued Contributor III
2,370 Views

The currently-available Stratix 10 boards physically support 16x PCI-E Gen 3.0. Not sure what the OpenCL driver supports, though.

0 Kudos
Nooraini_Y_Intel
Employee
2,370 Views

Hi @MUsman​ 

 

Can you help to address this question from @MN.​ ?

"Do you know, if Intel plans to support OpenCL PCIe 16x gen3 in future, and if yes, when do you think it's available?"

 

Thank you,

 

Regards,

Nooraini

0 Kudos
MuhammadAr_U_Intel
2,370 Views

Looking at the Stratix10 Devkit initialization guide provided with latest version ofcompiler 18.1, I can see the "aocl diagnose" result shows Gen3x8.

We don't have information when PCIe Gen3 x16 will be supported.

 

Thanks,

Arslan

0 Kudos
Reply