Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16603 Discussions

Calling EnqueuTask multiple times is not working.

Altera_Forum
Honored Contributor II
1,542 Views

I am running one program on my server with 32GB RAM. We have requirement to do matrix multiplication of 64x64x256 image with filter of the same size (64x64x256). Problem is that we have 2614600 such filter and image and filter pixel values are in float. 

 

We have done something like as described to achieve our requirement but program gets killed without any error or warning. Not even segmentation fault. We monitored top command for the same and we noticed that virtual memory resource keeps increasing and at some threshold, program get killed.  

 

Note that this is pseudo code to cross check out requirement. Original code will not iterate up to 69206016. It will depend on some parameters.  

 

Note: filter number 2614600 does not matter as we may get more filter in future. Our main concern to run this loop forever. Currently our code terminates after around 32620 iteration. Theoretically it shall run forever. We just don't know what we are doing wrong. Can anyone help us on this?  

 

One more thing to know that we remove wait event in kernal then it is running but waiting is also our requirement. 

 

// Command queue fc_queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &status); checkError(status, "Failed to create command queue"); FC_input_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, 64x64x256 * sizeof(float), NULL, &status); checkError(status, "Failed to create buffer for input image"); FC_output_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, 1024 * sizeof(float), NULL, &status); checkError(status, "Failed to create buffer for input image"); status = clEnqueueWriteBuffer(fc_queue, FC_input_buf, CL_TRUE, 0, 64x64x256 * sizeof(float), input_image, 0, NULL, NULL); checkError(status, "Failed to transfer input."); for(unsigned int ff=0; ff< 2614600; ff++){ //// Number of filters //// // Buffer FC_weight_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, 64x64x256 * sizeof(float), NULL, &status); checkError(status, "Failed to create buffer for FC_weights"); // NOTE we have taken FC_weights common for each iteration in this pseudo code. In original, it will be some offset of fc_weights based on iteration number. But size with the same for each iteration. status = clEnqueueWriteBuffer(fc_queue, FC_weight_buf, CL_TRUE, 0, 64x64x256*sizeof(float), FC_weights, 0, NULL, NULL); checkError(status, "Failed to transfer weights"); unsigned argi = 0; status = clSetKernelArg(fc_kernal, argi++, sizeof(cl_mem), &FC_input_buf); checkError(status, "Failed to set argument %d", argi - 1); status = clSetKernelArg(fc_kernal, argi++, sizeof(cl_mem), &FC_weight_buf); checkError(status, "Failed to set argument %d", argi - 1); status = clSetKernelArg(fc_kernal, argi++, sizeof(cl_mem), &FC_output_buf); checkError(status, "Failed to set argument %d", argi - 1); cl_event kernel_event = NULL; status = clEnqueueTask(fc_queue, fc_kernal, 0, NULL, &kernel_event); checkError(status, "Failed to launch kernel"); // NOTE we have to wait until this kernel finish its execution. We can also use NDRange kernal here. I am getting the same problem in NDrange kernal also. clWaitForEvents(1, &kernel_event); clReleaseMemObject(FC_weight_buf); FC_weight_buf = NULL; } clReleaseMemObject(FC_input_buf); clReleaseMemObject(FC_output_buf); clReleaseCommandQueue(fc_queue);
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
564 Views

The code looks like it should work. 

I'm wondering if the FC_weight_buf[0] isn't getting released somehow. 

 

Could you try moving the the clCreateBuffer and clReleaseMemObject outside the for loop and see if you get the same error? 

Doing an clEnqueueWriteBuffer should still work as expected but it'll just overwrite the data from the previous loop which isn't being used anymore.
0 Kudos
Altera_Forum
Honored Contributor II
564 Views

 

--- Quote Start ---  

The code looks like it should work. 

I'm wondering if the FC_weight_buf[0] isn't getting released somehow. 

 

Could you try moving the the clCreateBuffer and clReleaseMemObject outside the for loop and see if you get the same error? 

Doing an clEnqueueWriteBuffer should still work as expected but it'll just overwrite the data from the previous loop which isn't being used anymore. 

--- Quote End ---  

 

 

We already tried this. We are getting the same error. 

 

We are trying to run the attached program and kernel and we now getting the following output. If this code works then it can directly be used in our real application. 

This standalone code always terminates at 32732 iteration and Virtual memory in top command continuously increasing and after reaching at ~320GB, program gets the following error. 

 

 

--- Quote Start ---  

 

...... 

Kernal 32729 -> 1.374 ms 

Kernal 32730 -> 1.309 ms 

Kernal 32731 -> 1.275 ms 

Context callback: Emulator: Could not start the kernel compute units. 

Kernal 32732 -> 1.152 ms 

host: acl_offline_hal.c:476: acl_emulator_hal_launch_kernel: Assertion `*thread == 0' failed. 

Aborted (core dumped) 

 

--- Quote End ---  

0 Kudos
Altera_Forum
Honored Contributor II
564 Views

I found that this is the memory leak issue with Emulator because code is perfectly working on Arria 10 FPGA and Cyclone 5 SOC.  

I discussed the same with Altera support and I am yet to get the confirmation on the same.
0 Kudos
Reply