Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15544 Discussions

EnqueueWrite Very Large Data

Altera_Forum
Honored Contributor II
820 Views

Hi, i found that when i enqueuewrite a very large data (millions value) to the kernel, it takes around 200ms which is significant degrades the performance of my application. 

So, I plan to either apply shared memory or fixed point to do some improvement.  

 

But i wonder which one is better approach? Or, do you have any better suggestion regarding this issue, thank you!!
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
93 Views

Are you using an SoC or a PCI-E-attached FPGA board? If you are using an SoC, shared memory is definitely the way to go. However, if you are using a PCI-E-attached board and a mere 200-ms host to device transfer time is significant compared to the actual computation time of your dataset, your application is communication-bound and there is probably no point in accelerating it on an FPGA (or any other PCE-E-attched device); you would be better off avoiding the communication altogether and just computing on the host CPU in this case.

Altera_Forum
Honored Contributor II
93 Views

Thank you for your explanation! Indeed, I am using the SoC, the documentation has very brief explanation about shared memory.  

Hence, i wonder that how the shared memory is actually implemented on FPGA and how fast is it compare to enqueuewritebuffer ? 

Thank you
Altera_Forum
Honored Contributor II
93 Views

On SoCs there is only one physical instance of DDR memory which is shared between the ARM processor and the FPGA. With shared memory on SoCs, you just malloc the data in memory and then create the OpenCL buffer using the host pointer, while with clEnqueueWriteBuffer, the data is copied one more time to the DDR memory. Using shared memory avoids the extra copy. "Intel FPGA SDK for OpenCL Programming Guide, Section 6.7" explains how to allocate shared memory on SoC boards.

Reply