Attempting to find the best way to integrate Intel's OpenCL in our product

Ben_Rush · ‎02-26-2017

I hope this topic is in line with the theme of this forum. I'll try to make it brief.

We have a computer vision product that, to date, hasn't taken advantage of the GPU. We're running on Intel NUCs and have started pushing the envelope in terms of what we can get away with on the CPU alone. So I've been looking into taking advantage of the GPU. However, certain architectural "quirks" in our SDK has made doing so difficult, and I'm wondering if maybe I'm missing/misunderstanding some OpenCL fundamentals/capabilities to get around them.

Our product is in Microsoft C#, so it's written in .NET. I have been experimenting with an open source product called Cloo which wraps OpenCL 1.2 in C# (though I'm a perfectly capable C/C++ developer and can use something else if needed). Our product works like this:

A frame comes off the camera (it's a Microsoft Kinect),
We execute "sensors" against the frame data on multiple threads (each sensor computes different things), and sometimes a sensor can take a few frames to complete,
We collect the results of the multiple threads (if they're ready, if not, we continue and check on the next frame - this is an oversimplification, but it is good enough) and update some state variables,
We continue.

Fundamentally, though, the problem I'm finding in how we've implemented our product is that we reallocate arrays and buffers all over the place. For better or worse, it's a major theme in our design. Part of this is because our "sensors" may need to hold onto data from a frame that was generated many frames ago (because it may take longer than others). So wiping arrays of points, etc. can corrupt data. Again, this theme persists throughout our product and so it impacts more than just point arrays.

To use OpenCL smartly on the Intel processor, it's my understanding I need to "pin" the data and then map/unmap it to read/write to it. On several simpler test programs I've been able to get this to work fine. However, because our real product reallocates arrays all the time, I'm forced to incur the overhead of creating the memory buffers, pinning, and then deallocating the memory buffers for each frame, and the overhead of doing so kills any performance gains we get from executing on the GPU.

Here's a code sample:

public static Point3D[] ComputePoints(CameraFrame frame, float[] M, float[] b)

This function takes a new camera frame (with brand new buffers internally) and also passes transformation matrices M and b (which can be new arrays as well) to a function called ComputePoints. Within ComputePoints I need to allocate new "ComputeBuffers" (as they're called in Cloo) EACH TIME. The array of Point3D objects that are returned is a new array as well (since a sensor may want to hold onto the reference for many frames).

I've been able to put routines on the GPU, but the overhead is too high because I have to do this each time I want to run anything:

       ComputeBuffer<ushort> inputBuffer = new ComputeBuffer<ushort>(_context,
          ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
          frame.RawData);
        _kernel.SetMemoryArgument(0, inputBuffer);

        Point3D[] ret = new Point3D[frame.Width * frame.Height]; 
        ComputeBuffer<Point3D> outputBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(1, outputBuffer);

        ComputeBuffer<float> mBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            M);
        _kernel.SetMemoryArgument(2, mBuffer);

        ComputeBuffer<float> bBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            b);
         _kernel.SetMemoryArgument(3, bBuffer);

Any advice? At the moment, redoing our design is going to be tough. Are there features in opencl 2 that would make this easier? Of course, this is all on an Intel NUC, so features of 2 like shared virtual memory may not be supported? This is our model:http://www.intel.com/content/www/us/en/nuc/nuc-kit-d54250wyk-board-d54250wyb.html

Michal_M_Intel · ‎02-27-2017

clCreateBuffer + CL_MEM_COPY_HOST_PTR does following things:

- it allocates memory

- it reserves GPU address range

- it copies the data from input host_ptr to the allocated storage

All of those operations are time consuming.

There is another mode of buffers creation from input host_ptr pointer:

clCreateBuffer + host_ptr + CL_MEM_USE_HOST_PTR

In this mode if host_ptr meets requirements ( it is aligned to cache line size or even better to page size ) and passed size is multiple of cache line then OpenCL driver will use incoming host_ptr as a storage of the buffer without any copies. if requirements are not met then driver will create internal copy of the memory.

More information is available here:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics