In this article there are recommendations about how to use OpenCL properly to achieve zero copy behavior when using Intel HD Graphics. In particular, there is a recommendation to use CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR in the following cases:
- You want the OpenCL runtime to handle the size and alignment requirements.
- In cases when you may be reading or writing data from a file or another I/O stream and aren't allowed to write to the buffer you are given.
- Buffer is not already in a properly aligned and sized allocation and you want it to be.
- You are okay with the performance cost of the copy relative to the length of time your application executes, for example at initialization.
- Porting existing application code where you don't know if it has been aligned and sized properly.
- The buffer used to create the OpenCL buffer needs the data to be unmodified and you want to write to the buffer
But what is the point of usage of CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR for the abovementioned cases, why can't we just use CL_MEM_COPY_HOST_PTR? Intel HD Graphics doesn't have its own memory, so a new buffer will be definitely allocated at RAM. And it seems that CL_MEM_COPY_HOST_PTR does all the necessary job about alignment and size (which is rather reasonable).
The only argument that came into my mind is that sometimes Intel HD Graphics do have its own relatively small memory, and by using CL_MEM_ALLOC_HOST_PTR we guarantee that the allocation will be definitely done at RAM, but it doesn't seem very convincing, so, maybe I miss something about CL_MEM_ALLOC_HOST_PTR's behavior.
You're right that the physical memory used by buffers on the CPU and GPU side shares the same hardware. However, you don't get a common address space without OpenCL 2.0 SVM. In OpenCL 1.x the host and device addresses can be different, but using the same physical hardware efficiently is possible with the models described below plus map/unmap.
Use this when an aligned buffer already exists on the host side. It must be aligned to a 4096 byte boundary and be a multiple of 64 bytes or you don't actually get zero copy. Below clCreateBuffer takes a host address and returns a corresponding device address.
ocl.srcA = clCreateBuffer(ocl.context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, inputA, &err);
This mode creates a buffer automatically *in device address space*. When running clCreateBuffer in this mode the host_ptr address should be NULL.
ocl.srcA = clCreateBuffer(ocl.context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, NULL, &err);
If you want to use data from the host you will also need to map/unmap as with CL_MEM_USE_HOST_PTR above.
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR:
As the article indicates, this is for cases when you don't control host buffer alignment (for example, using someone else's library). It can be more efficient than CL_MEM_USE_HOST_PTR for unaligned data. In this case, the host address is required to generate the GPU/device side address, but underneath a copy is made so it isn't as efficient as the CL_MEM_USE_HOST_PTR scenario where the application provides optimal host addresses.
ocl.srcA = clCreateBuffer(ocl.context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, inputA, &err);
For more info:
But what if I create a buffer with CL_MEM_COPY_HOST_PTR only (without CL_MEM_ALLOC_HOST_PTR), and then, after kernel execution, call clEnqueueMapBuffer and get a pointer p_data in the host address space? Will clEnqueueMapBuffer in such case make one more copy of the buffer? I mean that access to the data by p_data pointer will cause additional copying?
The 3 options listed in the article are the recommended approaches:
- If you already have aligned host buffers, use CL_MEM_USE_HOST_PTR (no copies/additional buffers here if all the rules are met)
- If it makes sense for the device(GPU) to do the allocation, use CL_MEM_ALLOC_HOST_PTR
- If you need the GPU to work on unaligned data, use CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR
While CL_MEM_COPY_HOST_PTR is allowed, it may not work as well as the options above. The reason for these abstractions in OpenCL 1.2 is that the address spaces aren't guaranteed to be equivalent -- even when implemented on the same physical hardware. (You can try printing the host and device pointer addresses if you like to show that they are not interchangeable.) CL_MEM_USE_HOST_PTR provides the right translation between aligned CPU buffer addresses and device address space so that map/unmap will work -- the buffer alignment/size limitations on the CPU side are part of making sure that the addresses will translate in both directions. (CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR provides optimal ways to map for a broader range of CPU buffer scenarios and doesn't copy if it doesn't need to.) CL_MEM_COPY_HOST_PTR by itself doesn't guarantee that the translation between the address spaces will be optimal, so best case there will be more copies.
Is there a scenario not covered by the 3 options above which your application needs? If you want host and device side to work on the same data and take advantage of sharing the same hardware, is CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR sufficient given the additional details above?
the recommended options are pretty fine and cover all my needs, the initial question was just out of curiosity, because I didn't see the difference between using CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR vs CL_MEM_COPY_HOST_PTR, so I wanted to learn more about details (and I still would like to find an example for which CL_MEM_COPY_HOST_PTR by itself doesn't guarantee optimal translation between address spaces). In any case thank you for your elaborated explanations!