Porting GPU optimized OpenCL code to Altera FPGAs

Altera_Forum · ‎10-31-2015

Hi,

I'll soon be porting stereo-vision OpenCL code which was written and optimized for AMD GPUs. I don't have the board yet, but it will arrive in a few days. Does anyone have experience with porting GPU based code? Generally i have the following basic questions:

- Does generic OpenCL code "just work" on these FPGAs or are minor changes necessary, ignoring performance?

- Are major changes necessary in order to efficiently use the FPGA?

- What issues can I expect with delays because of real-time FPGA reconfiguration?

I know these questions are relatively generic, but I have zero prior experience with FPGAs but lots of experience with OpenCL on GPUs and even CPUs. I would very much appreciate any help.

Altera_Forum · ‎11-02-2015

For a start it depends upon whose FPGA you will use. If it is an Altera board you have a good chance that it might work "as is" without changes to the code, with Xilinx this is highly unlikely as they need directives added to the code to help SDAccel. You are at least starting with the right hardware.

The configuration delays are not normally excessive but it depends on how many kernels you have. The recommendation for FPGA is to structure the kernel as a single work-item. This is somewhat different from the code you will be used to writing as this is likely to be vector-based to get the best out of the GPU. With an FPGA the resources are flexible and so the use of pipes will help to create a single work-item (if possible) or to at least to minimise the transfers from host to kernel and back.

Altera_Forum · ‎11-02-2015

Thanks a lot for the reply. I was promised an Altera board with OpenCL support, although I don't know which one. I'll report back when I know more. If the code will just work immediately, even if it's not that fast I'd be really impressed. Thanks for the tip about using a single work group. Makes sense. I imagine you can use just about any work group size on an FPGA as long as there is space. About the configuration delays: will the FPGA be reconfigured with every enqueue kernel call? If not, how can the driver know how large the needed work group is. About the "not normally excessive" delays: can you reconfigure something like 10 times a second? Once a second? I'd just like to know an order of magnitude here as I'm sure the it depends a lot on the kernel complexity. Anyways: Kudos to Altera for saving me from VHDL.

Altera_Forum · ‎11-02-2015

The way in which it works is that you can specify one or more kernels and put these into an aocx file, but you can have one or more aocx files. If you have a single aocx file then this may contain several kernels, but these exist in parallel and can be used in parallel. If you only have one aocx file then this is configured just once. You should try and put all your kernels into one aocx file where possible.

If you have multiple aocx files then the delay will occur if you need to first use a kernel from one aocx file and then to use one from another, this requires reconfiguration. This will likely occur if you have large kernels which won't fit simultaneously into one device and so you have to manually decide which kernels go into your .cl file in order to create the aocx files. The worst case scenario is if you need to constantly swap between kernels from different aocx files.

I don't know what the reconfiguration delay is, and it will depend on the size of the FPGA. I would hazard a guess at a few 10s of milliseconds.