Hi Michael,

Michael_Northwind · ‎07-25-2012

Hello.
This simple project based on OpenCL/C++ wrapper even with empty kernel is getting huge memory leaks on Windows platform with Intel processor, but there is no any leaks for Mac OS or with graphics cards. Is there something wrong with code or this is problem wrapper/driver?

Project source:
https://gist.github.com/3175992

Used with Intel OpenCL SDK:
OpenCL CPU Runtime version - 2.0.0.31360
SDK Tools version - 2.0.0.31360
SDK Documentation version - 2.0.0.33655
SDK Installer version - 2.0.0.33746

Processor:
Intel Core i7-3517U

Raghupathi_M_Intel · ‎07-25-2012

Hi Michael,

Thanks for providing the test case. I have reproduced the leak (of course I had to reduce the number of iterations). I will investigate further and let you know what the issue is.

Thanks,
Raghu

Ilya_Kulakov · ‎08-05-2012

Any results? I've encountered the same problem.

Raghupathi_M_Intel · ‎08-13-2012

Sorry, it took a while to respond.

It appears that there is a bug in your app. The way it is written there is no way of making sureof the command queue completion.

You need to replace the call to clFlush with clFinish. Currently I cannot test the change, but I recommend you make the change and post here if you are still running into issues. Somehow I overlooked this minor issue in the source you submitted.

Thanks,
Raghu

Ilya_Kulakov · ‎08-14-2012

Does it mean that buffers are cleared only after whole queue is completed?

Our app has to manipulate multiple queues for different devices. It schedules commands (nonblocking) and adds a completion function to the last command. In this function we check whether result is found. If it's not found, we schedule another bunch of commands.

As you can see, it's vital to avoid blocking operations. I think it's the job of OpenCL implementation to clear all temp buffers it creates.

Right now it looks like that Intel's OpenCL impletation cannot execute code while main CPU is busy. Could you confirm that?

We are going to try the following workaround: call clFinish instead of clFlush on each 1000 iteration, so the library will have time to clear buffers.

Raghupathi_M_Intel · ‎08-14-2012

Hi Ilya,

clFlush does not guarantee completion of commands, only submission of queued commands in the command queue to the device. OpenCL specifies that for non-blocking commands the contents of the buffer (for example in case of clEnqueueReadBuffer) cannot be used until the read command has completed. So its the app's responsibility to make sure the command has completed (e.g. by specifying and using an event).

Raghu

Ilya_Kulakov · ‎08-14-2012

Buffers are never shared neither used after command is submitted. Also, buffers are not shared between queues. All queues are in-order. (you can see all of that in Michael's code).

So I think it's pertty safe (accroding to OpenCL standard) to submit as many non-blocking commands as possible.

Anyway, why is it important to wait until _all_ commands are executed to clear temporal buffers that are under OpenCL library management only? As I understand, they can be (and should be) cleared along the execution. That's what we see when using Apple's implementatoin (OS X Lion) or nvidia's (Win 7).

Raghupathi_M_Intel · ‎08-14-2012

In the code submitted by Michael there were no blocking calls. Yes, the command queue is in-order, but what happens if the application shutsdown when the last command is being executed? I think itresults inundefined behavior.

I was wrong in suggesting to replace _all_ clFlushes to clFinish but there needs to be atleast one blocking call to make sure all commands are executed before the app shutsdown.

So adding a clFinish() at the end of the loop is sufficient, instead of replacing all clFlush() calls with clFinish().

for (size_t i = 0; i < numIters; ++i)
{
startSubOperation(data);
}

_queues[0].finish();

It appeared like a memory leak because startSubOperation() was enqueing commands rapidly without waiting for their completion. There was no memory leak, actually.

Raghu

Ilya_Kulakov · ‎08-15-2012

Yes, but it's still a bug, because library should reserve some time to clear internal buffers no matter how fast you enque commands.

Doron_S_Intel · ‎08-15-2012

Hello Michael and Ilya,

No resources related to a memory object can be released while commands referencing it are in-flight. The same goes to command queues, the contexts they exist in, etc. So, when using aynchronous API (correctly), even clRelease()ing everything won't guarantee any actual memory release occurs until commands in flight (that reference their Kernel object, memory objects and command queue) have completed.

If I'm reading your post correctly, you're worried that instead of clearing resources as commands complete, the Intel OpenCL implementation for some reason waits for a later time. I can assure you that's not the case. However, it's possible your host code submits work faster than we can process it, which would naturally lead to increased memory consumption to the point of running out of resources.

As for your other question, "Right now it looks like that Intel's OpenCL impletation cannot execute code while main CPU is busy. Could you confirm that?", our CPU implementation runs threads on the CPU, so any application threads you're using would contend with themfor HW resources. However, the expected behaviour is not complete starvation of the OpenCL implementation, nor starvation of the host code. Could you clarify your question?

Thanks,
Doron Singer

Michael_Northwind · ‎08-16-2012

Thanks for response.

OK, I've modified code by inserting finish function at the end, but memory still leaking somewhere. For me it takes about 37 Mb of memory after all initializations, but before any kernel queueing. Peak memory usage about 120 Mb while working and before finishing queue. After calling finish() it takes 60 Mb and the more iterations we have the more difference between starting and ending values. Where this memory goes and how we can clear it? For our project it's really necessary to launch billions of not-so-long threads.

In the real project we are queueing this subOperations through kernel event callback, so calling finish in this function is not possible (it locks OpenCL thread, i've checked it to be sure). Such structure should not spam queue because it waits for completing of previous func block. I'm thinking of creating another thread that will control population of OpenCL threads and call finish() from time to time, but there's still leakage meant above.

Doron_S_Intel · ‎08-16-2012

The callback approach is fine.

Just to make sureI understand, the issue is that you take a snapshot of the memory consumption of the process just before enqueuing kernels, then enqueue all the kernels, wait for their completionand then take a second snapshot and find out memory consumption has changed in a way that can't be explained by lazy reclamation of pages by the OS, yes?

Doron Singer

Michael_Northwind · ‎08-17-2012

Yes, you're right. I suppose part of this memory goes to profiling info, but it accumulates with more enqueueing itarations and never cleared in run-time.

Evgeny_F_Intel · ‎09-12-2012

Hi Michael, Could you please update the GIT repository with your latest sources? Thanks, Evgeny

Memory leak