Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16593 Discussions

Memory leak using emulator

Altera_Forum
Honored Contributor II
2,248 Views

Hi 

 

I'm working on transforming a CUDA program to OpenCL and run on FPGA, right now I'm using the emulator since I don't have the device yet. 

I wrote a OpenCL kernel that do some simple computing on the image passed from the GPU, and for some reason the memory will increase dramatically for each pixel it compute, and then it will overflow at the third frame. 

The error massages are:  

Context callback: Could not allocate a buffer in host memory 

Context callback: Could not map host buffers to device 

ERROR: CL_OUT_OF_HOST_MEMORY 

 

 

I did release the buffer after each frame and free the host memory as well, but the memory still accumulate.  

 

Launching kernel part (runs for each frame): 

//////////////////////////////////////////////////////////////////////////////////////////////////////// cl_int status; cufftComplex* h_afPadScnOut; h_afPadScnOut = (cufftComplex *)malloc(giScnMemSzCmplx); CUDA_SAFE_CALL(cudaMemcpy(h_afPadScnOut, gd_afPadScnOut, giScnMemSzCmplx, cudaMemcpyDeviceToHost));// copy memory to host cl_mem cl_d_afPadScnOut = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, giScnMemSzCmplx, h_afPadScnOut, NULL); cl_event* write_event = (cl_event *)malloc(sizeof(cl_event)); status = clEnqueueWriteBuffer(queue, cl_d_afPadScnOut, CL_TRUE, 0, giScnMemSzCmplx, h_afPadScnOut, 0, NULL, write_event);// write into CL buffer checkError(status, "Failed to write buffer cl_gd_afPadScnOut"); // Set the kernel arguments status = clSetKernelArg(kthLaw_kernel, 0, sizeof(cl_mem), (void*)&cl_d_afPadScnOut); checkError(status, "Failed to set kernel arg 0"); status = clSetKernelArg(kthLaw_kernel, 1, sizeof(cl_int), (void*)&giScnSz); checkError(status, "Failed to set kernel arg 1"); printf("\nKernel initialization is complete.\n"); printf("Launching the kernel...\n\n"); // Configure work set over which the kernel will execute size_t wgSize = { 256, 1, 1 }; size_t gSize = { 307200, 1, 1 }; // Launch the kernel status = clEnqueueNDRangeKernel(queue, kthLaw_kernel, 1, NULL, gSize, wgSize, 1, write_event, NULL); checkError(status, "Failed to launch kernel"); clReleaseEvent(*write_event); //Read back data status = clEnqueueReadBuffer(queue, cl_d_afPadScnOut, CL_TRUE, 0, giScnMemSzCmplx, h_afPadScnOut, 0, NULL, NULL); checkError(status, "Failed to read buffer cl_gd_afPadScnOut"); //Free CL buffer status = clReleaseMemObject(cl_d_afPadScnOut); checkError(status, "Failed to release buffer"); // Wait for command queue to complete pending events status = clFinish(queue); checkError(status, "Failed to finish"); printf("\nKernel execution is complete.\n"); // Free the resources allocated //AOCLcleanup(); CUDA_SAFE_CALL(cudaMemcpy(gd_afPadScnOut, h_afPadScnOut, giScnMemSzCmplx, cudaMemcpyHostToDevice)); free(h_afPadScnOut); //////////////////////////////////////////////////////////////////////////////////////////////////////// 

 

Kernel: __kernel void kthLaw(__global float2* d_afPadScn, int dataN) { int iIndx = get_global_id(0); if (iIndx < dataN) { //afVals(:) = (abs(afVals(:)).^k) .* (cos(angle(afVals(:))) + sin(angle(afVals(:)))*i); float2 cDat = d_afPadScn; float fNewAbsDat = pow(sqrtf(pow(cDat.x, 2) + pow(cDat.y, 2)), 0.1); float fAngDat = atan2(cDat.y, cDat.x); cDat.x = fNewAbsDat*cosf(fAngDat); cDat.y = fNewAbsDat*sinf(fAngDat); d_afPadScn = cDat; } } 

 

Also I saw the memory increasing from the task manager, is there a way to print out the memory usage form the kernel? 

 

Any advice will be appreciated. 

 

-------------------------update------------------------- 

Well I read some material from Altera and they said executing large number of parallel kernels is not feasible on FPGA, instead we should use pipeline design.  

So I wrote the kernel in serial and the memory problem was no more and emulator runs faster!  

I guess I haven't think it through properly, the emulator emulates the behavior of a FPGA where the kernels are actual hardware, of cause it can't be freed in runtime...
0 Kudos
7 Replies
Altera_Forum
Honored Contributor II
752 Views

Why do you have "cudaMemcpy" in an OpenCL code??!! How do you even compile this code?

0 Kudos
Altera_Forum
Honored Contributor II
752 Views

 

--- Quote Start ---  

Why do you have "cudaMemcpy" in an OpenCL code??!! How do you even compile this code? 

--- Quote End ---  

 

 

Because the original code was written in CUDA and runs on CPU/GPU, and I want to port the code to OpenCL one part at a time.  

So this code is running on CPU/GPU and emulator(also CPU).  

I compile the way the Altera design examples do, Visual Studio(include cuda headers) for the host program and the aoc for the kernel and execute in command prompt. 

The CUDA part shouldn't be a problem since the output frames are fine.
0 Kudos
Altera_Forum
Honored Contributor II
752 Views

Are you trying to perform part of your computation on a GPU using CUDA, and then pass the output to an FPGA using OpenCL? If this is the case, I wouldn't expect it to work at all since you are mixing libraries with completely different characteristics. Since OpenCL works just fine on GPUs, I recommend porting everything to OpenCL first on a GPU, and then trying to port it for FPGAs. 

 

Also if you are using "clEnqueueWriteBuffer" to write your host buffer to device, you shouldn't use "CL_MEM_USE_HOST_PTR" when creating the device buffer; the latter is for when you do not want to explicitly copy the buffer from host to device, and let the OpenCL runtime to decide when or how to do the transfer. This is mostly useful for targeting CPUs to avoid allocating two copies of the same buffer in host memory (which is the same as device memory in this case).
0 Kudos
Altera_Forum
Honored Contributor II
752 Views

The goal is to port everything to OpenCL/FPGA. The whole program is not trivial so I'm trying to port one kernel at a time.  

Thank you for the suggestion, I would try to run OpenCL on GPU as well. 

 

Do you mean CL_MEM_COPY_HOST_PTR is the proper way? I did try it but the result was the same. 

 

The problem I'm facing is whenever I launch a kernel thread on the emulator it would take more memory space, and it doesn't release them after it's done. 

----------------------------- 

Well I guess I found the problem:p
0 Kudos
Altera_Forum
Honored Contributor II
752 Views

 

--- Quote Start ---  

Do you mean CL_MEM_COPY_HOST_PTR is the proper way? I did try it but the result was the same. 

--- Quote End ---  

 

 

No, if you are going to manually copy the host buffer to device using clWriteBuffer, you should create your device buffer like this: 

 

cl_mem cl_d_afPadScnOut = clCreateBuffer(context, CL_MEM_READ_WRITE, giScnMemSzCmplx, NULL, NULL);
0 Kudos
Altera_Forum
Honored Contributor II
752 Views

Thanks HRZ 

 

What's still bothering me is why does the memory keep accumulating when I launch it in parallel for each frame? 

When I use serial for loop the memory doesn't increase as more frames are computed. 

The same kernel launched with different data input should use the same hardware in the FPGA right?
0 Kudos
Altera_Forum
Honored Contributor II
752 Views

I am not really sure, I guess for parallel operations the emulator keeps all data in memory until execution has finished. Since the emulator is extremely slow anyway, if I do have to use it, I try to debug my code with very small inputs.

0 Kudos
Reply