- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am not very experienced as c++ programmer so forgive me if I have missed something obvious.
Please direct me somewhere else if this is the wrong forum.
I am on windows 7 with visual studio 2013.
Using C:\Program Files (x86)\Intel\OpenCL SDK\4.4\include\CL\cl.hpp as c++ interface to opencl.
Looking at cl::CommandQueue::enqueueMapBuffer it seems the optional cl::Event* argument is casted directly to (cl_event *) and passed to the c api unlike in most other similar functions in cl.hpp. This bypasses the assignment operator for class cl::Event and thus clReleaseEvent is not called for the contained cl_event value before it is overwritten.
To me it seems code should be changed as below.
void* enqueueMapBuffer( const Buffer& buffer, cl_bool blocking, cl_map_flags flags, ::size_t offset, ::size_t size, const VECTOR_CLASS<Event>* events = NULL, Event* event = NULL, cl_int* err = NULL) const { cl_int error; void * result = ::clEnqueueMapBuffer( object_, buffer(), blocking, flags, offset, size, (events != NULL) ? (cl_uint) events->size() : 0, (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL, (cl_event*) event, &error); detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR); if (err != NULL) { *err = error; } return result; } void* enqueueMapBufferFix( const Buffer& buffer, cl_bool blocking, cl_map_flags flags, ::size_t offset, ::size_t size, const VECTOR_CLASS<Event>* events = NULL, Event* event = NULL, cl_int* err = NULL) const { cl_event tmp; cl_int error; void * result = ::clEnqueueMapBuffer( object_, buffer(), blocking, flags, offset, size, (events != NULL) ? (cl_uint)events->size() : 0, (events != NULL && events->size() > 0) ? (cl_event*)&events->front() : NULL, (event != NULL) ? &tmp : NULL, &error); detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR); if (err != NULL) { *err = error; } if (event != NULL && error == CL_SUCCESS) *event = tmp; return result; }
The problems I detect from this bug is a large increase in thread count when using the NVIDIA platform.
Output from my test program:
$ ./ReproduceBugInClHppEnqueueMapBuffer.exe AMD Accelerated Parallel Processing OpenCL 1.2 AMD-APP (1445.5) bug did not trigger for Tahiti bug did not trigger for Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz NVIDIA CUDA OpenCL 1.1 CUDA 6.5.12 BUG TRIGGERED for GeForce GTX 660 Intel(R) OpenCL OpenCL 1.2 bug did not trigger for Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
The test program:
#include <iostream> #include <cl/cl.hpp> #include <memory> #include <tuple> using namespace std; using namespace cl; #if defined(_WIN32) #include <windows.h> #include <tlhelp32.h> /** This function is originally copied from http://stackoverflow.com/questions/3749668/how-to-query-the-thread-count-of-a-process-using-the-regular-windows-c-c-apis/3750019#3750019 */ int GetCurrentThreadCount() { DWORD const id = GetCurrentProcessId(); HANDLE const snapshot = CreateToolhelp32Snapshot(TH32CS_SNAPALL, 0); PROCESSENTRY32 entry = { 0 }; entry.dwSize = sizeof(entry); BOOL ret = true; ret = Process32First(snapshot, &entry); while (ret && entry.th32ProcessID != id) { ret = Process32Next(snapshot, &entry); } CloseHandle(snapshot); if (!ret) throw std::exception("GetCurrentThreadCount() Failed"); return entry.cntThreads; } #else static_assert(1 == 0, "Sorry, I only have a windows implementation of GetCurrentThreadCount()"); #endif // _WIN32 void perform_mapping_test(Device & device, int n, bool use_workaround = false) { Context context(device); CommandQueue queue(context, device, 0, nullptr); Buffer buffer(context, CL_MEM_ALLOC_HOST_PTR, n * sizeof(cl_uint), nullptr, nullptr); Event e; std::vector<Event> events; // This statement does not trigger the bug, because there is no cl_event assigned to _e_ there is no need to call clReleaseEvent. cl_uint * p = (cl_uint *)queue.enqueueMapBuffer(buffer, false, CL_MAP_WRITE, 0, n * sizeof(cl_uint), nullptr, &e); events = { e }; Event::waitForEvents(events); for (int i = 0; i < n; i++) { p = i; } queue.enqueueUnmapMemObject(buffer, p, nullptr, &e); events = { e }; Event::waitForEvents(events); if (use_workaround) e = nullptr; // This will call clReleaseEvent for the cl_event contained in _e_ as part of the asignement operator implementation. // Unless use_workaround was specified this will trigger the bug. _e_ contains a cl_event that is forgotten/overwritten with no call to clReleaseEvent. p = (cl_uint *)queue.enqueueMapBuffer(buffer, false, CL_MAP_READ, 0, n * sizeof(cl_uint), nullptr, &e); events = { e }; Event::waitForEvents(events); for (int i = 0; i < n; i++) { if (p != i) throw std::exception("UNEXPECTED ERROR! something unrelated(?) to the bug is very wrong."); } queue.enqueueUnmapMemObject(buffer, p, nullptr, &e); events = { e }; Event::waitForEvents(events); } void main() { vector<Platform> platforms; Platform::get(&platforms); for (Platform platform : platforms) { cout << endl << platform.getInfo<CL_PLATFORM_NAME>() << " " << platform.getInfo<CL_PLATFORM_VERSION>() << endl; vector<Device> devices; platform.getDevices(CL_DEVICE_TYPE_ALL, &devices); for (Device device : devices) { auto deviceName = device.getInfo<CL_DEVICE_NAME>(); int initialCount = GetCurrentThreadCount(); for (int i = 0; i < 10; i++) perform_mapping_test(device, 1024, true); int threadCountIncreaseForWorkaround = GetCurrentThreadCount() - initialCount; for (int i = 0; i < 10; i++) perform_mapping_test(device, 1024, false); int threadCountIncreaseWhenTryingToTriggerBug = GetCurrentThreadCount() - initialCount - threadCountIncreaseForWorkaround; if (threadCountIncreaseForWorkaround > 10) throw std::exception("UNEXPECTED ERROR! workaround failed."); if (threadCountIncreaseWhenTryingToTriggerBug > 10) { cout << "BUG TRIGGERED for " << deviceName << endl; } else { cout << "bug did not trigger for " << deviceName << endl; } } } }
cl::CommandQueue::enqueueMapImage seems to have the same bug.
/Viktor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Viktor,
I looked at your example and looks like there are 2 different issues:
1. If Event object, that is used for capturing newly created cl_event in cl::CommandQueue::enqueueMapBuffer(), already contains some cl_event, the previous value of cl_event will be leaked.
I agree with your finding and agree that your proposed fix will fix the problem. But we in Intel cannot fix it because this file is owned by Khronos Group (https://www.khronos.org/opencl/). I'll forward your report to relevant persons to verify it and escalate the issue if required.
2. I tried to understand what your test program is doing but I'm not sure I got it. Looks like you are running enqueueMapBuffer test and check for number of threads inside your process. Why do you assume that process thread count is somehow influenced by OpenCL API calls? It may be and may be not and is totally dependent on OpenCL driver implementation.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Viktor,
I looked at your example and looks like there are 2 different issues:
1. If Event object, that is used for capturing newly created cl_event in cl::CommandQueue::enqueueMapBuffer(), already contains some cl_event, the previous value of cl_event will be leaked.
I agree with your finding and agree that your proposed fix will fix the problem. But we in Intel cannot fix it because this file is owned by Khronos Group (https://www.khronos.org/opencl/). I'll forward your report to relevant persons to verify it and escalate the issue if required.
2. I tried to understand what your test program is doing but I'm not sure I got it. Looks like you are running enqueueMapBuffer test and check for number of threads inside your process. Why do you assume that process thread count is somehow influenced by OpenCL API calls? It may be and may be not and is totally dependent on OpenCL driver implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
Lots of extra threads was a simple observation to make.
Presumably every call to perform_mapping_test for nvidia leaks a context which in turn leaks 6 (i think) threads.
I was testing lots of different ways to run kernels and copy memory when I got a cl_out_of_host_memory (i think).
Breaking into visual showed a huge amount of threads so I used that to diagnose bug or not when I searched for the error.
Let me know if should report this somewhere else.
Thanks for your help
Viktor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Viktor,
Extra threads is not an error - it is a normal behaviour for most of OpenCL implementations. The error is a lack of clReleaseEvent calls in the context you specified.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Normally when I create an Nvidia platform context 6* new threads show up and then they disappear when the context object is destroyed (i.e. clReleaseContext is called).
When I use a non-empty cl::Event as argument to cl::CommandQueue::enqueueMapBuffer then these threads do not disappear but instead accumulates.
I believe that an event with non-zero reference-counter causes the nvidia platform to keep the context alive even after there is no explicit reference to the context.
This means the bug in cl:CommandQueue::enqueueMapBuffer causes a leak of 6 threads every time I destroy an affected cl::Context instance.
I found the bug in enqueueMapBuffer from debugging this thread leak.
Viktor
* Don't quote me on the exact number. Possibly it is platform or command queue creation that spawns these threads instead of the context. Also presumably lots of exceptions for first context in the process and such.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page