Re: Memory leak using emulator

Altera_Forum · ‎07-12-2017

Hi

I'm working on transforming a CUDA program to OpenCL and run on FPGA, right now I'm using the emulator since I don't have the device yet.

I wrote a OpenCL kernel that do some simple computing on the image passed from the GPU, and for some reason the memory will increase dramatically for each pixel it compute, and then it will overflow at the third frame.

The error massages are:

Context callback: Could not allocate a buffer in host memory

Context callback: Could not map host buffers to device

ERROR: CL_OUT_OF_HOST_MEMORY

I did release the buffer after each frame and free the host memory as well, but the memory still accumulate.

Launching kernel part (runs for each frame):

////////////////////////////////////////////////////////////////////////////////////////////////////////
    cl_int status;
    cufftComplex* h_afPadScnOut;
    h_afPadScnOut = (cufftComplex *)malloc(giScnMemSzCmplx);
    CUDA_SAFE_CALL(cudaMemcpy(h_afPadScnOut, gd_afPadScnOut, giScnMemSzCmplx, cudaMemcpyDeviceToHost));// copy memory to host
    cl_mem cl_d_afPadScnOut = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, giScnMemSzCmplx, h_afPadScnOut, NULL);
    cl_event* write_event = (cl_event *)malloc(sizeof(cl_event));
    status = clEnqueueWriteBuffer(queue, cl_d_afPadScnOut, CL_TRUE, 0, giScnMemSzCmplx, h_afPadScnOut, 0, NULL, write_event);// write into CL buffer
    checkError(status, "Failed to write buffer cl_gd_afPadScnOut");
    // Set the kernel arguments 
    status = clSetKernelArg(kthLaw_kernel, 0, sizeof(cl_mem), (void*)&cl_d_afPadScnOut);
    checkError(status, "Failed to set kernel arg 0");
    status = clSetKernelArg(kthLaw_kernel, 1, sizeof(cl_int), (void*)&giScnSz);
    checkError(status, "Failed to set kernel arg 1");
    printf("\nKernel initialization is complete.\n");
    printf("Launching the kernel...\n\n");
    // Configure work set over which the kernel will execute
    size_t wgSize = { 256, 1, 1 };
    size_t gSize = { 307200, 1, 1 };
    // Launch the kernel
    status = clEnqueueNDRangeKernel(queue, kthLaw_kernel, 1, NULL, gSize, wgSize, 1, write_event, NULL);
    checkError(status, "Failed to launch kernel");
    clReleaseEvent(*write_event);
    //Read back data 
    status = clEnqueueReadBuffer(queue, cl_d_afPadScnOut, CL_TRUE, 0, giScnMemSzCmplx, h_afPadScnOut, 0, NULL, NULL);
    checkError(status, "Failed to read buffer cl_gd_afPadScnOut");
    //Free CL buffer
    status = clReleaseMemObject(cl_d_afPadScnOut);
    checkError(status, "Failed to release buffer");
    // Wait for command queue to complete pending events
    status = clFinish(queue);
    checkError(status, "Failed to finish");
    printf("\nKernel execution is complete.\n");
    // Free the resources allocated
    //AOCLcleanup();
    CUDA_SAFE_CALL(cudaMemcpy(gd_afPadScnOut, h_afPadScnOut, giScnMemSzCmplx, cudaMemcpyHostToDevice));
    free(h_afPadScnOut);
    ////////////////////////////////////////////////////////////////////////////////////////////////////////

Kernel:


__kernel void kthLaw(__global float2* d_afPadScn, int dataN)
{
    int iIndx = get_global_id(0);
    if (iIndx < dataN)
    {
        //afVals(:) = (abs(afVals(:)).^k) .* (cos(angle(afVals(:))) + sin(angle(afVals(:)))*i);
        float2 cDat = d_afPadScn;
        float fNewAbsDat = pow(sqrtf(pow(cDat.x, 2) + pow(cDat.y, 2)), 0.1);
        float fAngDat = atan2(cDat.y, cDat.x);
        cDat.x = fNewAbsDat*cosf(fAngDat);
        cDat.y = fNewAbsDat*sinf(fAngDat);
        d_afPadScn = cDat;
    }
}

Also I saw the memory increasing from the task manager, is there a way to print out the memory usage form the kernel?

Any advice will be appreciated.

-------------------------update-------------------------

Well I read some material from Altera and they said executing large number of parallel kernels is not feasible on FPGA, instead we should use pipeline design.

So I wrote the kernel in serial and the memory problem was no more and emulator runs faster!

I guess I haven't think it through properly, the emulator emulates the behavior of a FPGA where the kernels are actual hardware, of cause it can't be freed in runtime...

Altera_Forum · ‎07-13-2017

Why do you have "cudaMemcpy" in an OpenCL code??!! How do you even compile this code?

Altera_Forum · ‎07-13-2017

--- Quote Start ---

Why do you have "cudaMemcpy" in an OpenCL code??!! How do you even compile this code?

--- Quote End ---

Because the original code was written in CUDA and runs on CPU/GPU, and I want to port the code to OpenCL one part at a time.

So this code is running on CPU/GPU and emulator(also CPU).

I compile the way the Altera design examples do, Visual Studio(include cuda headers) for the host program and the aoc for the kernel and execute in command prompt.

The CUDA part shouldn't be a problem since the output frames are fine.

Altera_Forum · ‎07-13-2017

Are you trying to perform part of your computation on a GPU using CUDA, and then pass the output to an FPGA using OpenCL? If this is the case, I wouldn't expect it to work at all since you are mixing libraries with completely different characteristics. Since OpenCL works just fine on GPUs, I recommend porting everything to OpenCL first on a GPU, and then trying to port it for FPGAs.

Also if you are using "clEnqueueWriteBuffer" to write your host buffer to device, you shouldn't use "CL_MEM_USE_HOST_PTR" when creating the device buffer; the latter is for when you do not want to explicitly copy the buffer from host to device, and let the OpenCL runtime to decide when or how to do the transfer. This is mostly useful for targeting CPUs to avoid allocating two copies of the same buffer in host memory (which is the same as device memory in this case).

Altera_Forum · ‎07-13-2017

The goal is to port everything to OpenCL/FPGA. The whole program is not trivial so I'm trying to port one kernel at a time.

Thank you for the suggestion, I would try to run OpenCL on GPU as well.

Do you mean CL_MEM_COPY_HOST_PTR is the proper way? I did try it but the result was the same.

The problem I'm facing is whenever I launch a kernel thread on the emulator it would take more memory space, and it doesn't release them after it's done.

-----------------------------

Well I guess I found the problem:p

Altera_Forum · ‎07-14-2017

--- Quote Start ---

Do you mean CL_MEM_COPY_HOST_PTR is the proper way? I did try it but the result was the same.

--- Quote End ---

No, if you are going to manually copy the host buffer to device using clWriteBuffer, you should create your device buffer like this:

cl_mem cl_d_afPadScnOut = clCreateBuffer(context, CL_MEM_READ_WRITE, giScnMemSzCmplx, NULL, NULL);

Altera_Forum · ‎07-14-2017

Thanks HRZ

What's still bothering me is why does the memory keep accumulating when I launch it in parallel for each frame?

When I use serial for loop the memory doesn't increase as more frames are computed.

The same kernel launched with different data input should use the same hardware in the FPGA right?

Altera_Forum · ‎07-15-2017

I am not really sure, I guess for parallel operations the emulator keeps all data in memory until execution has finished. Since the emulator is extremely slow anyway, if I do have to use it, I try to debug my code with very small inputs.