First, I want to thank you

Ben_Rush · ‎04-25-2016

In our project we mix Microsoft .NET code with native code, and we're trying to speed up areas using OpenCL. Here is a block of code I'm working on:

array<System::Byte>^ OpenCLBase::DoIt(array<System::UInt16>^ toDo, int maxDiff, int width, int height)
{
    array<System::Byte>^ retManaged = gcnew array<System::Byte> (toDo->Length); 

    pin_ptr<unsigned char> retPinned = &retManaged[0];
    unsigned char* retBuffer = retPinned;

    pin_ptr<unsigned short> managedPin = &toDo[0];
    unsigned short* pinnedData = managedPin;

    cl_int error; 
    int bufferSizeInBytes = 512 * 424 * 2; 

    cl_mem inputBuffer = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
        bufferSizeInBytes, pinnedData, &error);

    /* IF THIS IS UNCOMMENTED, THEN 'V' BELOW IS 55. AND THE PERFORMANCE IS HIGH. */
    /* ========================================================================== */
    cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
        retManaged->Length, NULL, &error);
    /* ========================================================================== */

    /* IF THIS IS UNCOMMENTED, THEN 'V' BELOW IS 55. THE PERFORMANCE IS NOT AS GOOD. */
    /* ============================================================================= */
    //unsigned char* mask = (unsigned char*)_aligned_malloc(512 * 424, 4096);
    //cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    //    retManaged->Length, mask, &error);
    /* ============================================================================= */

    /* IF THIS IS UNCOMMENTED, THEN 'V' BELOW IS 0. */
    /* ============================================ */
    //cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    //    retManaged->Length, retBuffer, &error);
    /* ============================================ */

    if (error == CL_SUCCESS)
    {
        cl_event clearCompleted;

        void* mappedBuffer = ::clEnqueueMapBuffer(
            _queue, // the command queue
            outputBuffer, // the output buffer
            CL_TRUE, // can't be unmapped before read
            CL_MAP_WRITE, // mapped for writing. 
            0, // no offset
            retManaged->Length, // the size
            0, // no events on the waiting list
            NULL, // no event list
            &clearCompleted, // event to wait on
            &error); 

        error = clWaitForEvents(1, &clearCompleted);

        if (error == CL_SUCCESS)
        {
            error = clSetKernelArg(_kernel, 0, sizeof(cl_mem), (void*)&inputBuffer);
            error |= clSetKernelArg(_kernel, 1, sizeof(cl_mem), (void*)&outputBuffer);
            error |= clSetKernelArg(_kernel, 2, sizeof(height), &height);
            error |= clSetKernelArg(_kernel, 3, sizeof(width), &width);
            error |= clSetKernelArg(_kernel, 4, sizeof(maxDiff), &maxDiff);

            if (error == CL_SUCCESS)
            {
                size_t workgroupDims[2];
                workgroupDims[0] = width;
                workgroupDims[1] = height;

                //https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html

                DWORD dwStart = ::GetTickCount(); 
                error = clEnqueueNDRangeKernel(_queue, // the queue
                    _kernel, // the kernel
                    2, // the number of dimensions of this work-group
                    NULL, // always NULL
                    workgroupDims, // the dimensons of the work group
                    NULL, // # of work-items in a work group. NULL = letting OpenCL figure it out. 
                    0, 0, 0); // no events to wait on, nor are we waiting. 
                clFinish(_queue); // blocks until the queue has finished 

                char* test = (char*)mappedBuffer; 
                char v = test[0]; 

                // v is 55. 
            }
        }

        clReleaseMemObject(inputBuffer); 
        clReleaseMemObject(outputBuffer); 
    }

    return retManaged;
}

I also have a toy kernel which just sets the memory to 55 (ignore the extra parameters as I had to shave this kernel function down to illustrate my point):

__kernel void FindEdges(__global ushort* iterateValues, 
    __global char* writeValues,
    int height, int width, int maxDiff)
{
    const int x     = get_global_id(0);
    const int y     = get_global_id(1);
    const int stride = get_global_size(0);

    int i = y*stride+x; 

    ushort val = iterateValues; 
    int minval = val - maxDiff;
    int maxval = val + maxDiff;

    writeValues = 55; 
}

For those who maybe don't know much about Microsoft's managed C++, it allows you to write native code that interacts closely with "managed" .NET code. If you want to access raw memory from managed C++, then you need to "pin" it. This prevents Microsoft's garbage collector from moving the memory around and causing issues with native code that expects memory to stay in one spot. You can see what I'm doing if you look at " pin_ptr<unsigned char>" in the above code (pin_ptr<> is a pinned pointer). To access the raw memory, one then just casts the pinned pointer to a native type. This is a snippet from above that shows this:

    pin_ptr<unsigned char> retPinned = &retManaged[0];
    unsigned char* retBuffer = retPinned;

Problem is, if I want to use clEnqueueMapBuffer and map this pointer, it doesn't appear to work. I use the value of "char v = test[0]" above to note whether it's working. In the code above I have three regions that can be commented/uncommented. If I want everything to run fine, then I uncomment this code:

   cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
        retManaged->Length, NULL, &error);

The "char v" at the bottom of the code snippet above is 55 (which is expected). If I uncomment this,

unsigned char* mask = (unsigned char*)_aligned_malloc(512 * 424, 4096);
cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    retManaged->Length, mask, &error);

...then I also get expected behavior. But it's slower (as a side note, why is this? Why is using CL_MEM_USE_HOST_PTR slower than in my first example?).

However, if I uncomment this, then nothing works at all. :

cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    retManaged->Length, retBuffer, &error);

Meaning the "char v" at the bottom of the code is uninitialized, almost as if the memory never got set.

I'm really new at this, but I've poured over the code and can't find anything I'm doing wrong. I've tried executing clCreateBuffer against natively allocated (malloc) buffers that aren't page aligned before and it worked, so I'm not thinking it's due to that (maybe it is still)?

The width/height are 512/424, respectively.

Any help would be appreciated.

Robert_I_Intel · ‎04-25-2016

Hi Ben,

First the usual questions:

1. Which processor?

2. Which OS?

3. Which graphics driver version?

4. Are you running on the CPU device or GPU device?

5. Could you possibly provide a small reproducer that I can run?

What is the value of retManaged->Length? Is it a multiple of 64 bytes? That is an additional requirement for high performance in addition to 4096 alignment.

Technically, the code above is incorrect: you need to enqueue kernel first then enqueue Map buffer for reading - your first clEnqueueMapBuffer call does not clear anything: if you want to clear, you will need to do so right after clEnqueueMapBuffer succeeded. You also don't need to call clWaitForEvents, since you have a blocking call (third parameter is CL_TRUE).

But, in case where you use __align_malloc, that's the pointer that you need to clear.

1. clear mask - you already got it - no need to Map/Unmap

2. clEnqueueNDRange

3. mask = clEnqueueMapBuffer for read - a blocking call, so no need to wait for events after that

4. read mask - you should get 55 all over

5. clEnqueueUnmapBuffer - you could make it a blocking call too.

Hope this helps.

Ben_Rush · ‎04-26-2016

First, I want to thank you for (continuing) to provide such excellent support on this group. You're truly the most helpful person I've encountered in this area. Whatever your salary is at Intel, you should tell them to double it ;)

Second, here are answers to the bullet items above:

1. Intel Core i7-6700 @ 4 GHz. Skylake.
2. Windows 10, 640bit.
3. According to Device Manager, the driver date for the Intel Graphics 530 is 20.19.15.4380.
4. GPU
5. I will work on something that you can reproduce.

Okay. So I followed your instructions and got everything working. So perhaps I have a fundamental misunderstanding of when I need to use the map/unmap read/write buffer calls. Let me ask a few questions.

It was my understanding that when using graphics devices which do not have shared memory that I need to call clEnqueueWriteBuffer for every buffer not created automatically on the device itself, before actually enqueuing the kernel, right? However, if I call

cl_mem outputBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
                                bufferSize, NULL, &error);

Then I shouldn't need to clEnqueueWriteBuffer because that buffer is being created on the device itself; but I will need to call clEnqueueReadBuffer to get it. Right? So this makes sense when it comes to devices which need to transfer data over the bus. I guess the paradigm is confusing to me when there's shared memory...

When there's shared memory, why do I even need to call the clEnqueueMapBuffer at all? The pointer returned is the same pointer as that which is passed to clCreateBuffer. So the pointer "retBuffer" below....

    cl_mem outputBuffer = clCreateBuffer(_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
        retManaged->Length, retBuffer, &error);

Is actually the same pointer as "mappedBuffer" below.....

                void* mappedBuffer = ::clEnqueueMapBuffer(
                    _queue, // the command queue
                    outputBuffer, // the output buffer
                    CL_TRUE, // can't be unmapped before read
                    CL_MAP_READ, // mapped for writing. 
                    0, // no offset
                    retManaged->Length, // the size
                    0, // no events on the waiting list
                    NULL, // no event list
                    &clearCompleted, // event to wait on
                    &error);

Am I making sure internal bookeeping is done or making sure the memory block is ready to be accessed by invoking clEnqueueMapBuffer?

Robert_I_Intel · ‎04-26-2016

Hi Ben,

Thanks for the kind words! I will send your comments to my manager: let's see if I get any kind of a raise :)

In the Shared Physical Memory case, clEnqueueMapBuffer acts as a synchronization point: once it succeeds, you know that the device finished writing to the buffer and it is all ready for consumption on the host. Also note that you don't have to necessarily map the whole buffer - you could map a portion of it so that mappedBuffer is not equal to retBuffer in that case.

So, for Intel(R) Processor Graphics, you should avoid read/write buffer calls and use map/unmap instead. Also, when creating buffers either use CL_MEM_USE_HOST_PTR flag or CL_MEM_ALLOC_HOST_PTR flag. In the latter case, you will need to map the buffer to initialize its content.

Strange behavior when using "managed c++" memory.