Speed up data-transfer between Host and Device

Altera_Forum · ‎06-14-2017

Is there a way to speed up the data transfer between Host(ARM) and Device(CYCLONE V)?

I need to load an uchar array of the size 672 x 2993 to the FPGA -> compute on this array -> get results from the FPGA

which needs 0.056 seconds total. But 0.042 seconds are used for the function calls of:

clEnqueueWriteBuffer,

3* clSetKernelArg,

clEnqueueReadBuffer

(the clCreateBuffer-Functions are called once in an init-methode)

Thanks :)

Altera_Forum · ‎06-14-2017

Well, this trade-off always exists that if your host to device transfer takes longer than your compute, you just compute on the host...

There is a note in Altera's documents that your OpenCL buffers must be 64-bit aligned to get full performance of DMA through PCI-E for host to device transfers, but I don't think this applies to the Cyclone SoCs since there is no PCI-E. On the other hand, I was under the impression that you can have shared memory between the ARM and the FPGA on these SoCs so that everything you malloc on the ARM is directly accessible to the FPGA. You can try passing the host pointer to the FPGA (CL_MEM_USE_HOST_PTR), instead of copying the data, and see what happens.

Altera_Forum · ‎06-14-2017

The SoCs have shared memory between the HPS and FPGA. Try declaring the buffer with the CL_MEM_ALLOC_HOST_PTR flag. More info is in the opencl programming guide (https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf#page=93)

Altera_Forum · ‎06-14-2017

I followed the example in 1.7.7 Allocating Shared Memory for OpenCL Kernels Targeting SoCs, but my code runs with outputing wrong values (When using without the CL_MEM_ALLOC_HOST_PTR - part it works)

Do you have a full hello world example for this task? Or do you might look trough my implementation if i´m doing something wrong:


... context, queue, programm already created
input : Mat img (Matrix from OpenCV which holds an grayscale value)
    int size_cols = COLS + 6;    int size_rows = ROWS + 6;
    copyMakeBorder(img, img,3,3,3,3,BORDER_REFLECT_101 );
    Mat output(img.rows, img.cols, CV_8UC1);
    int pixels = img.rows* img.cols;
    // Buffer
    buffer_img_GAUSS = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(uchar) * size_cols * size_rows,NULL,&status);
    checkError(status, "clCreateBuffer");
    uchar *src_ptr  = (uchar *)clEnqueueMapBuffer(cmdQueue, buffer_img_GAUSS, CL_TRUE, CL_MAP_READ, 0, sizeof(uchar) * size_cols* size_rows, 0, NULL, NULL, &status);
    checkError(status, "clEnqueueMapBuffer");
    *src_ptr = *img.data; // data = pointer to the data of img
    buffer_outputimg_GAUSS = clCreateBuffer(context,CL_MEM_WRITE_ONLY,    sizeof(uchar) *  size_cols *  size_rows,NULL,&status);
    //Kernel
    kernel_gaussneu = clCreateKernel(program, "gaussneu", &status);
    // Set Arguments
    status = clSetKernelArg(kernel_gaussneu,0,sizeof(cl_mem),&buffer_img_GAUSS);
    checkError(status, "clSetKernelArg");
    status = clSetKernelArg(kernel_gaussneu,1,sizeof(cl_mem),&buffer_outputimg_GAUSS);
    status = clSetKernelArg(kernel_gaussneu,2,sizeof(int), &pixels);
    size_t sobelSize = 1;
    // Run Kernel
    clEnqueueUnmapMemObject(cmdQueue, buffer_img_GAUSS, src_ptr, 0, NULL, NULL);
    status = clEnqueueNDRangeKernel(cmdQueue, kernel_gaussneu, 1, NULL, &sobelSize, &sobelSize, 0, NULL, NULL);
    checkError(status, "clEnqueueNDRangeKernel");
    clFinish(cmdQueue);
    // Get output
    status = clEnqueueReadBuffer(cmdQueue, buffer_outputimg_GAUSS, CL_FALSE, 0, sizeof(uchar) * output.cols * output.rows, output.data, 0, NULL, NULL);
    checkError(status, "clEnqueueReadBuffer");

Thanks for the help

Altera_Forum · ‎06-14-2017

Try moving you `clEnqueueUnmapMemObject` call before launching your kernel. From the opencl documentation (https://www.khronos.org/registry/opencl/sdk/1.0/docs/man/xhtml/clenqueuemapbuffer.html):

"The contents of the regions of a memory object mapped for writing (i.e. CL_MAP_WRITE is set in map_flags argument to clEnqueueMapBuffer or clEnqueueMapImage) are considered to be undefined until this region is unmapped. Reads and writes by a kernel executing on a device to a memory region(s) mapped for writing are undefined."

Altera_Forum · ‎06-14-2017

Thanks for the suggestion, i changed it but it didn´t change the output. I`ll update my code to the edited one. Do you have any other ideas?

Altera_Forum · ‎06-14-2017

Looking at this again actually, I would say there is likely an issue where you set your data

*src_ptr = *img.data;

You are dereferencing your pointers (the asterisks) which means you are not actually setting the pointer but rather accessing the value at that pointer. (I don't remember the behaviour of this exactly... but I believe this will just access the first value in the array?). Try without the asterisks.

If that doesn't work, try using `memcpy` to copy the values from img.data to src_ptr directly... I believe this is only shared physical memory at this point and I'm not sure how well the implementation handles virtual memory address space on the host side, maybe there is a problem unmapping `src_ptr` from a different location? (this last bit is speculation though and I don't know enough to say for sure)

Altera_Forum · ‎06-20-2017

Replacing

*src_ptr = *img.data; with

memcpy ( src_ptr, img.data, sizeof(uchar) * size_cols * size_rows);

worked but know it have a copy in my code again (no improvement in regards to runtime). Using this

src_ptr = img.data;

Gives an empty output (as always). Do you have an idea how to handle it without the memcpy-function? Thanks

Altera_Forum · ‎06-21-2017

--- Quote Start ---

Well, this trade-off always exists that if your host to device transfer takes longer than your compute, you just compute on the host...

There is a note in Altera's documents that your OpenCL buffers must be 64-bit aligned to get full performance of DMA through PCI-E for host to device transfers, but I don't think this applies to the Cyclone SoCs since there is no PCI-E. On the other hand, I was under the impression that you can have shared memory between the ARM and the FPGA on these SoCs so that everything you malloc on the ARM is directly accessible to the FPGA. You can try passing the host pointer to the FPGA (CL_MEM_USE_HOST_PTR), instead of copying the data, and see what happens.

--- Quote End ---

I found folowing phrase in the Intel® FPGA SDK for OpenCL™ Programming Guide:

you cannot use the library function malloc or the operator new to allocatephysically shared memory. also, the cl_mem_use_host_ptr flag does not workwith shared memory.

So I think it´s not possible to use this with the Cyclone V SoC because it has the shared memory by default?

Also i have a qequetion about the part in the Intel® FPGA SDK for OpenCL™ Programming Guide on S. 95 which explains the use of CL_MEM_ALLOC_HOST_PTR.

With this flag I only allocate the memory on the shared memory between CPU and FPGA but since i´m using the Cyclone V Soc it´s declared by default (using clCreateBuffer) on the shared memory?

So how does this improve the data-transfer between host and device? Isn´t there a way the FPGA can use the data from the shrared memory directly?

Maybe someone can say what this part of the guide means. Is this the way to go to transfer the data efficient? But i understand the sense of this.


To transfer data from shared hard processor system (HPS) DDR to FPGA DDR efficiently, include a kernel that performs the memcpy function, as shown below.
__attribute__((num_simd_work_items(8))) 
mem_stream(__global uint * src, __global uint * dst) 
{ size_t gid = get_global_id(0); dst = src; }

Altera_Forum · ‎06-01-2018

Thank you guys ! In CV kit is very slowly BSP to upload data to ARM, this direct memory methods with CL_MEM_ALLOC_HOST_PTR makes working up to 3 times faster.