Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
17252 Discussions

Bad Performance of Host to OpenCL Memory Transfers

Altera_Forum
Honored Contributor II
1,523 Views

I'm working with the Socrates II SoC Cyclone V board and am just getting started. After trying a few basic examples I noticed very low performance for an integer division test. It was almost as slow as the ARM CPU. Taking a closer look, it turns out that the kernel actually executes very quickly, and it's the transfer from host memory buffers to OpenCL buffers that takes almost all the time. This is especially confusing, since the ARM cores and the FPGA share the same physical memory on the Socrates II. Here's some data: 

 

Host buffer to OpenCL buffer transfer rate: ~30MB/s 

Memory usage/throughput of the division kernel (may not be memory bound): ~1.5GB/s 

Nominal memory speed: 3.2GB/s 

 

Did anyone else encounter this problem? Secondly, is it possible to set up shared memory between FPGA and the ARM cores? They share the same physical memory, but I don't know any OpenCL feature for doing such a thing. Maybe some ALTERA extension? 

 

Thanks in advance for any help.
0 Kudos
4 Replies
Altera_Forum
Honored Contributor II
760 Views

I did some more tests, here are the results: 

 

Memory read test kernel (Read array of integers and compute XOR): ~2.1GB/s 

Memory write test kernel (Write consecutive integers into an array): ~2.0GB/s 

Host memory test (memcpy): ~1.5GB/s 

 

To clarify: 

The extremely low performance is caused by the clEnqueueReadBuffer and clEnqueueWriteBuffer calls. These calls block even though I pass CL_FALSE to the blocking_write/blocking_read parameter. CPU usage is also high during these operations. Is this simply an ALTERA OpenCL bug?
0 Kudos
Altera_Forum
Honored Contributor II
760 Views

You can try allocating shared memory between the FPGA and ARM CPU. For example, see page 1-63 of the Altera SDK for OpenCL Programming Guide: https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf .

0 Kudos
Altera_Forum
Honored Contributor II
760 Views

 

--- Quote Start ---  

I did some more tests, here are the results: 

 

Memory read test kernel (Read array of integers and compute XOR): ~2.1GB/s 

Memory write test kernel (Write consecutive integers into an array): ~2.0GB/s 

Host memory test (memcpy): ~1.5GB/s 

 

To clarify: 

The extremely low performance is caused by the clEnqueueReadBuffer and clEnqueueWriteBuffer calls. These calls block even though I pass CL_FALSE to the blocking_write/blocking_read parameter. CPU usage is also high during these operations. Is this simply an ALTERA OpenCL bug? 

--- Quote End ---  

 

 

How did you get the problem is from there? you read from report?
0 Kudos
Altera_Forum
Honored Contributor II
760 Views

 

--- Quote Start ---  

You can try allocating shared memory between the FPGA and ARM CPU. For example, see page 1-63 of the Altera SDK for OpenCL Programming Guide: https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf

--- Quote End ---  

 

 

Thanks for the tip. It works, but unfortunately it doesn't quite solve my problem, since writing into the shared memory buffer (with memcpy) is just as slow as the clEnqueuWriteBuffer call. It is quite interesting though that memcpy is slower for some memory buffers than for others when there is supposedly only one physical DDR3 memory on the board. The document you linked mentions a HPS DDR and a FPGA DDR. Maybe is not one physical memory after all. I see no other explanation. 

 

 

--- Quote Start ---  

How did you get the problem is from there? you read from report? 

--- Quote End ---  

 

 

I simply measured the time the clEnqueueWriteBuffer and clEnqueueReadBuffer calls take.
0 Kudos
Reply