Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
1,234 Views

How OpenCL synthesizes hardware on FPGA

Hi, 

I have some doubts concerning how OpenCL synthesizes hardware into FPGA. Both in single work-item and NDRange kernels, in the "vector_add" example (available on https://www.altera.com/support/support-resources/design-examples/design-software/opencl/vector-addit...) how is the hardware realized into FPGA? In the above example, the kernel (NDRange mode) executes one milion of sums and I would like to say how the hardware is realized into the FPGA (if I use the single-work item kernel instead of NDRange kernel how does the hardware change respect to NDRange case?). Thanks for your help 

 

Marco Montini
0 Kudos
4 Replies
Altera_Forum
Honored Contributor I
36 Views

Please take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 1.3" and the existing discussions on the forum: 

 

http://www.alteraforum.com/forum/showthread.php?t=56121 

http://www.alteraforum.com/forum/showthread.php?t=57296 

http://www.alteraforum.com/forum/showthread.php?t=57273 

 

If you still have questions, I will try to help.
Altera_Forum
Honored Contributor I
36 Views

I read all the topics you gave me and I also read the section 1.3 of Intel FPGA SDK for OpenCL Best Practises Guide but I still have some doubts. If I am using a single work-item kernel to do a vector add, as in the example, I know that the loop iterations are pipelined but how can I know the hardware that is synthesized inside the FPGA? If I see the images in the best practises guide it seems there is just one adder,two registers for loading and one for storing. Is it the real hardware created inside the FPGA? If yes then the data for the operations can be acquired by accessing N times to DDR (for global variables). Thanks

Altera_Forum
Honored Contributor I
36 Views

In the specific case of vector-add, whether the kernel is NDRange or single work-item, the compiler will create one adder and three ports to global memory (two reads and one write), plus some buffers between global memory and the kernel to absorb possible stalls and some registers to allow pipelining. In this case, 2N values will be read from global memory, and N values will be written, with three values being read/written per clock. This will obviously result in poor performance; hence, SIMD (for NDRange kernels) and unrolling (for single work-item) can be used to increase the number of adders that are synthesized, and widen the ports to memory, to allow more data to be loaded and added per clock to improve performance.

Altera_Forum
Honored Contributor I
36 Views

Thanks for your help

Reply