This simple project based on OpenCL/C++ wrapper even with empty kernel is getting huge memory leaks on Windows platform with Intel processor, but there is no any leaks for Mac OS or with graphics cards. Is there something wrong with code or this is problem wrapper/driver?
Used with Intel OpenCL SDK:
OpenCL CPU Runtime version - 18.104.22.168360
SDK Tools version - 22.214.171.124360
SDK Documentation version - 126.96.36.199655
SDK Installer version - 188.8.131.52746
Intel Core i7-3517U
Thanks for providing the test case. I have reproduced the leak (of course I had to reduce the number of iterations). I will investigate further and let you know what the issue is.
It appears that there is a bug in your app. The way it is written there is no way of making sureof the command queue completion.
You need to replace the call to clFlush with clFinish. Currently I cannot test the change, but I recommend you make the change and post here if you are still running into issues. Somehow I overlooked this minor issue in the source you submitted.
clFlush does not guarantee completion of commands, only submission of queued commands in the command queue to the device. OpenCL specifies that for non-blocking commands the contents of the buffer (for example in case of clEnqueueReadBuffer) cannot be used until the read command has completed. So its the app's responsibility to make sure the command has completed (e.g. by specifying and using an event).
I was wrong in suggesting to replace _all_ clFlushes to clFinish but there needs to be atleast one blocking call to make sure all commands are executed before the app shutsdown.
So adding a clFinish() at the end of the loop is sufficient, instead of replacing all clFlush() calls with clFinish().
for (size_t i = 0; i < numIters; ++i)
It appeared like a memory leak because startSubOperation() was enqueing commands rapidly without waiting for their completion. There was no memory leak, actually.
No resources related to a memory object can be released while commands referencing it are in-flight. The same goes to command queues, the contexts they exist in, etc. So, when using aynchronous API (correctly), even clRelease()ing everything won't guarantee any actual memory release occurs until commands in flight (that reference their Kernel object, memory objects and command queue) have completed.
If I'm reading your post correctly, you're worried that instead of clearing resources as commands complete, the Intel OpenCL implementation for some reason waits for a later time. I can assure you that's not the case. However, it's possible your host code submits work faster than we can process it, which would naturally lead to increased memory consumption to the point of running out of resources.
As for your other question, "Right now it looks like that Intel's OpenCL impletation cannot execute code while main CPU is busy. Could you confirm that?", our CPU implementation runs threads on the CPU, so any application threads you're using would contend with themfor HW resources. However, the expected behaviour is not complete starvation of the OpenCL implementation, nor starvation of the host code. Could you clarify your question?
OK, I've modified code by inserting finish function at the end, but memory still leaking somewhere. For me it takes about 37 Mb of memory after all initializations, but before any kernel queueing. Peak memory usage about 120 Mb while working and before finishing queue. After calling finish() it takes 60 Mb and the more iterations we have the more difference between starting and ending values. Where this memory goes and how we can clear it? For our project it's really necessary to launch billions of not-so-long threads.
In the real project we are queueing this subOperations through kernel event callback, so calling finish in this function is not possible (it locks OpenCL thread, i've checked it to be sure). Such structure should not spam queue because it waits for completing of previous func block. I'm thinking of creating another thread that will control population of OpenCL threads and call finish() from time to time, but there's still leakage meant above.
Just to make sureI understand, the issue is that you take a snapshot of the memory consumption of the process just before enqueuing kernels, then enqueue all the kernels, wait for their completionand then take a second snapshot and find out memory consumption has changed in a way that can't be explained by lazy reclamation of pages by the OS, yes?