Hi, i found that when i enqueuewrite a very large data (millions value) to the kernel, it takes around 200ms which is significant degrades the performance of my application.So, I plan to either apply shared memory or fixed point to do some improvement. But i wonder which one is better approach? Or, do you have any better suggestion regarding this issue, thank you!!
Are you using an SoC or a PCI-E-attached FPGA board? If you are using an SoC, shared memory is definitely the way to go. However, if you are using a PCI-E-attached board and a mere 200-ms host to device transfer time is significant compared to the actual computation time of your dataset, your application is communication-bound and there is probably no point in accelerating it on an FPGA (or any other PCE-E-attched device); you would be better off avoiding the communication altogether and just computing on the host CPU in this case.
Thank you for your explanation! Indeed, I am using the SoC, the documentation has very brief explanation about shared memory.Hence, i wonder that how the shared memory is actually implemented on FPGA and how fast is it compare to enqueuewritebuffer ? Thank you
On SoCs there is only one physical instance of DDR memory which is shared between the ARM processor and the FPGA. With shared memory on SoCs, you just malloc the data in memory and then create the OpenCL buffer using the host pointer, while with clEnqueueWriteBuffer, the data is copied one more time to the DDR memory. Using shared memory avoids the extra copy. "Intel FPGA SDK for OpenCL Programming Guide, Section 6.7" explains how to allocate shared memory on SoC boards.