Solved: Hi Viktor,I looked at your

Viktor_L_ · ‎08-15-2014

Hi,
I am not very experienced as c++ programmer so forgive me if I have missed something obvious.
Please direct me somewhere else if this is the wrong forum.

I am on windows 7 with visual studio 2013.
Using C:\Program Files (x86)\Intel\OpenCL SDK\4.4\include\CL\cl.hpp as c++ interface to opencl.

Looking at cl::CommandQueue::enqueueMapBuffer it seems the optional cl::Event* argument is casted directly to (cl_event *) and passed to the c api unlike in most other similar functions in cl.hpp. This bypasses the assignment operator for class cl::Event and thus clReleaseEvent is not called for the contained cl_event value before it is overwritten.

To me it seems code should be changed as below.

    void* enqueueMapBuffer(
        const Buffer& buffer,
        cl_bool blocking,
        cl_map_flags flags,
        ::size_t offset,
        ::size_t size,
        const VECTOR_CLASS<Event>* events = NULL,
        Event* event = NULL,
        cl_int* err = NULL) const
    {
        cl_int error;
        void * result = ::clEnqueueMapBuffer(
            object_, buffer(), blocking, flags, offset, size,
            (events != NULL) ? (cl_uint) events->size() : 0,
            (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
            (cl_event*) event,
            &error);

        detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR);
        if (err != NULL) {
            *err = error;
        }
        return result;
    }

    void* enqueueMapBufferFix(
        const Buffer& buffer,
        cl_bool blocking,
        cl_map_flags flags,
        ::size_t offset,
        ::size_t size,
        const VECTOR_CLASS<Event>* events = NULL,
        Event* event = NULL,
        cl_int* err = NULL) const
    {
        cl_event tmp;
        cl_int error;
        void * result = ::clEnqueueMapBuffer(
            object_, buffer(), blocking, flags, offset, size,
            (events != NULL) ? (cl_uint)events->size() : 0,
            (events != NULL && events->size() > 0) ? (cl_event*)&events->front() : NULL,
            (event != NULL) ? &tmp : NULL,
            &error);

        detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR);
        if (err != NULL) {
            *err = error;
        }

        if (event != NULL && error == CL_SUCCESS)
            *event = tmp;

        return result;
    }

The problems I detect from this bug is a large increase in thread count when using the NVIDIA platform.

Output from my test program:

$ ./ReproduceBugInClHppEnqueueMapBuffer.exe 

AMD Accelerated Parallel Processing   OpenCL 1.2 AMD-APP (1445.5)
bug did not trigger for Tahiti
bug did not trigger for        Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz

NVIDIA CUDA   OpenCL 1.1 CUDA 6.5.12
BUG TRIGGERED for GeForce GTX 660

Intel(R) OpenCL   OpenCL 1.2 
bug did not trigger for        Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz

The test program:

#include <iostream>
#include <cl/cl.hpp>
#include <memory>
#include <tuple>

using namespace std;
using namespace cl;

#if defined(_WIN32)

#include <windows.h>
#include <tlhelp32.h>

/**
This function is originally copied from http://stackoverflow.com/questions/3749668/how-to-query-the-thread-count-of-a-process-using-the-regular-windows-c-c-apis/3750019#3750019
*/
int GetCurrentThreadCount()
{
    DWORD const  id = GetCurrentProcessId();
    HANDLE const  snapshot = CreateToolhelp32Snapshot(TH32CS_SNAPALL, 0);
    PROCESSENTRY32 entry = { 0 };
    entry.dwSize = sizeof(entry);
    BOOL  ret = true;
    ret = Process32First(snapshot, &entry);
    while (ret && entry.th32ProcessID != id) {
        ret = Process32Next(snapshot, &entry);
    }
    CloseHandle(snapshot);
    if (!ret) throw std::exception("GetCurrentThreadCount() Failed");
    return entry.cntThreads;
}
#else
static_assert(1 == 0, "Sorry, I only have a windows implementation of GetCurrentThreadCount()");
#endif // _WIN32

void perform_mapping_test(Device & device, int n, bool use_workaround = false)
{
    Context context(device);
    CommandQueue queue(context, device, 0, nullptr);
    Buffer buffer(context, CL_MEM_ALLOC_HOST_PTR, n * sizeof(cl_uint), nullptr, nullptr);
    Event e;
    std::vector<Event> events;

    // This statement does not trigger the bug, because there is no cl_event assigned to _e_ there is no need to call clReleaseEvent.
    cl_uint * p = (cl_uint *)queue.enqueueMapBuffer(buffer, false, CL_MAP_WRITE, 0, n * sizeof(cl_uint), nullptr, &e);

    events = { e };
    Event::waitForEvents(events);

    for (int i = 0; i < n; i++)
    {
        p = i;
    }

    queue.enqueueUnmapMemObject(buffer, p, nullptr, &e);
    events = { e };
    Event::waitForEvents(events);

    if (use_workaround) e = nullptr; // This will call clReleaseEvent for the cl_event contained in _e_ as part of the asignement operator implementation.

    // Unless use_workaround was specified this will trigger the bug. _e_ contains a cl_event that is forgotten/overwritten with no call to clReleaseEvent.
    p = (cl_uint *)queue.enqueueMapBuffer(buffer, false, CL_MAP_READ, 0, n * sizeof(cl_uint), nullptr, &e);

    events = { e };
    Event::waitForEvents(events);

    for (int i = 0; i < n; i++)
    {
        if (p != i) throw std::exception("UNEXPECTED ERROR! something unrelated(?) to the bug is very wrong.");
    }

    queue.enqueueUnmapMemObject(buffer, p, nullptr, &e);
    events = { e };
    Event::waitForEvents(events);
}

void main()
{
    vector<Platform> platforms;
    Platform::get(&platforms);
    for (Platform platform : platforms)
    {
        cout << endl << platform.getInfo<CL_PLATFORM_NAME>() << "   " << platform.getInfo<CL_PLATFORM_VERSION>() << endl;
        vector<Device> devices;
        platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
        for (Device device : devices)
        {
            auto deviceName = device.getInfo<CL_DEVICE_NAME>();
            int initialCount = GetCurrentThreadCount();

            for (int i = 0; i < 10; i++) perform_mapping_test(device, 1024, true);
            int threadCountIncreaseForWorkaround = GetCurrentThreadCount() - initialCount;

            for (int i = 0; i < 10; i++) perform_mapping_test(device, 1024, false);
            int threadCountIncreaseWhenTryingToTriggerBug = GetCurrentThreadCount() - initialCount - threadCountIncreaseForWorkaround;

            if (threadCountIncreaseForWorkaround > 10) throw std::exception("UNEXPECTED ERROR! workaround failed.");
            if (threadCountIncreaseWhenTryingToTriggerBug > 10)
            {
                cout << "BUG TRIGGERED for " << deviceName << endl;
            }
            else
            {
                cout << "bug did not trigger for " << deviceName << endl;
            }
        }
    }
}

cl::CommandQueue::enqueueMapImage seems to have the same bug.

/Viktor

Dmitry_K_Intel · ‎08-18-2014

Hi Viktor,

I looked at your example and looks like there are 2 different issues:

1. If Event object, that is used for capturing newly created cl_event in cl::CommandQueue::enqueueMapBuffer(), already contains some cl_event, the previous value of cl_event will be leaked.

I agree with your finding and agree that your proposed fix will fix the problem. But we in Intel cannot fix it because this file is owned by Khronos Group (https://www.khronos.org/opencl/). I'll forward your report to relevant persons to verify it and escalate the issue if required.

2. I tried to understand what your test program is doing but I'm not sure I got it. Looks like you are running enqueueMapBuffer test and check for number of threads inside your process. Why do you assume that process thread count is somehow influenced by OpenCL API calls? It may be and may be not and is totally dependent on OpenCL driver implementation.

View solution in original post

Dmitry_K_Intel · ‎08-18-2014

Hi Viktor,

I looked at your example and looks like there are 2 different issues:

1. If Event object, that is used for capturing newly created cl_event in cl::CommandQueue::enqueueMapBuffer(), already contains some cl_event, the previous value of cl_event will be leaked.

I agree with your finding and agree that your proposed fix will fix the problem. But we in Intel cannot fix it because this file is owned by Khronos Group (https://www.khronos.org/opencl/). I'll forward your report to relevant persons to verify it and escalate the issue if required.

2. I tried to understand what your test program is doing but I'm not sure I got it. Looks like you are running enqueueMapBuffer test and check for number of threads inside your process. Why do you assume that process thread count is somehow influenced by OpenCL API calls? It may be and may be not and is totally dependent on OpenCL driver implementation.

Viktor_L_ · ‎08-19-2014

Hi Dmitry,

Lots of extra threads was a simple observation to make.
Presumably every call to perform_mapping_test for nvidia leaks a context which in turn leaks 6 (i think) threads.
I was testing lots of different ways to run kernels and copy memory when I got a cl_out_of_host_memory (i think).
Breaking into visual showed a huge amount of threads so I used that to diagnose bug or not when I searched for the error.

Let me know if should report this somewhere else.

Thanks for your help
Viktor

Dmitry_K_Intel · ‎08-19-2014

Hi Viktor,

Extra threads is not an error - it is a normal behaviour for most of OpenCL implementations. The error is a lack of clReleaseEvent calls in the context you specified.

Viktor_L_ · ‎08-19-2014

Hi,

Normally when I create an Nvidia platform context 6* new threads show up and then they disappear when the context object is destroyed (i.e. clReleaseContext is called).
When I use a non-empty cl::Event as argument to cl::CommandQueue::enqueueMapBuffer then these threads do not disappear but instead accumulates.
I believe that an event with non-zero reference-counter causes the nvidia platform to keep the context alive even after there is no explicit reference to the context.
This means the bug in cl:CommandQueue::enqueueMapBuffer causes a leak of 6 threads every time I destroy an affected cl::Context instance.

I found the bug in enqueueMapBuffer from debugging this thread leak.

Viktor

* Don't quote me on the exact number. Possibly it is platform or command queue creation that spawns these threads instead of the context. Also presumably lots of exceptions for first context in the process and such.

Bug in cl.hpp, enqueueMapBuffer does not properly call clReleaseEvent before owerwriting the optional cl::Event argument.