How OpenCL synthesizes hardware on FPGA

Altera_Forum · ‎01-09-2018

Hi,

I have some doubts concerning how OpenCL synthesizes hardware into FPGA. Both in single work-item and NDRange kernels, in the "vector_add" example (available on https://www.altera.com/support/support-resources/design-examples/design-software/opencl/vector-addition.html) how is the hardware realized into FPGA? In the above example, the kernel (NDRange mode) executes one milion of sums and I would like to say how the hardware is realized into the FPGA (if I use the single-work item kernel instead of NDRange kernel how does the hardware change respect to NDRange case?). Thanks for your help

Marco Montini

Altera_Forum · ‎01-10-2018

Please take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 1.3" and the existing discussions on the forum:

http://www.alteraforum.com/forum/showthread.php?t=56121

http://www.alteraforum.com/forum/showthread.php?t=57296

http://www.alteraforum.com/forum/showthread.php?t=57273

If you still have questions, I will try to help.

Altera_Forum · ‎01-10-2018

I read all the topics you gave me and I also read the section 1.3 of Intel FPGA SDK for OpenCL Best Practises Guide but I still have some doubts. If I am using a single work-item kernel to do a vector add, as in the example, I know that the loop iterations are pipelined but how can I know the hardware that is synthesized inside the FPGA? If I see the images in the best practises guide it seems there is just one adder,two registers for loading and one for storing. Is it the real hardware created inside the FPGA? If yes then the data for the operations can be acquired by accessing N times to DDR (for global variables). Thanks

Altera_Forum · ‎01-10-2018

In the specific case of vector-add, whether the kernel is NDRange or single work-item, the compiler will create one adder and three ports to global memory (two reads and one write), plus some buffers between global memory and the kernel to absorb possible stalls and some registers to allow pipelining. In this case, 2N values will be read from global memory, and N values will be written, with three values being read/written per clock. This will obviously result in poor performance; hence, SIMD (for NDRange kernels) and unrolling (for single work-item) can be used to increase the number of adders that are synthesized, and widen the ports to memory, to allow more data to be loaded and added per clock to improve performance.

Altera_Forum · ‎01-10-2018

Thanks for your help