OpenCL mechanism of migration

Ruzhanskaia__Anastas · ‎09-05-2019

Hello everyone!

I would like to ask, where to read about the mechanisms, lying underneath the migration of memory objects from and to device. I would like to receive a detailed explanation for any kind of OpenCL implementation ( Intel is ok) what is happening during the buffers migration process(clEnqueueMigrateMemObjects function) (as well as enqueuing read (clEnqueueReadBuffer) and write buffers): maybe some caching is used, maybe some additional buffers, maybe dma, what calls are used to transfer the data over PCIe lanes - all such kind of details.

Sergey_I_Intel1 · ‎09-06-2019

Hello Anastasiia,

As OpenCL is an open standard maintained by Khronos Group, you can find a lot on information in OpenCL Specification.

Intel implementation of OpenCL collected in Runtimes on open source repositories on github:

Intel(R) Graphics Compute Runtime for OpenCL(TM)

Intel Project for LLVM* technology

Best regards,

Sergey

Ben_A_Intel · ‎09-06-2019

Hi Anastasiia,

I'm not an Intel FPGA expert so I can't speak to the FPGA OpenCL implementation, but our current CPU and GPU OpenCL devices use the same memory as the host processor, hence no "migration" is necessary. This will change for our discrete GPU devices, but we aren't ready to talk about that just yet.

The links Sergey sent previously may be helpful to see how this is currently implemented. Here are a few other links that I have found useful:

https://software.intel.com/en-us/file/the-compute-architecture-of-intel-processor-graphics-gen9-v1d0pdf - This is a white paper that describes our Gen9 integrated GPU architecture, including its caches and memory hierarchy.
https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics - This is an article and code sample demonstrating how to access memory on the host processor and the OpenCL device without copies.

Thanks!

Ruzhanskaia__Anastas · ‎09-09-2019

Hi Ben,

thank you for an answer. Regarding the source code of OpenCL, which I was provided with in the first answer - from the first glance it is really hard to understand, what is the internal implementation like.

I looked though your tutorial, but still, I have some high level questions, which are also not really clear to me. Maybe you can clarify them in this thread ( relative to Intel implementation will be also fine):

This is my current view on three current ways of moving data:

1) enqueueReadBuffer/enqueueWriteBuffer - these two functions always copy the content of the buffer, created on the host, to the device, and from the device. No pinned memory and no DMA mechanism are used here.

2) enqueueMigrateMemObjects - this is sometimes described as an alternative to enqueueRead/Write, but in this case, memory is copied exactly at the time of this function call. No pinned memory and no DMA mechanism are used here.

3) enqueueMapBuffer/enqueueUnmapBuffer - here always pinned memory and DMA mechanism are used.

This function uses two types of buffers: created with CL_MEM_USE_HOST_PTR flag or CL_MEM_ALLOC_HOST_PTR flag. With the first one, we map an array, created on the host, to the array, created on the device. With the second array is allocated on the device and maps it to the newly created array on the host.

This is what I can state according to the documentation.

But for these paragraphs I have following questions accordingly:

1) If these functions do only copying, then why here https://software.intel.com/en-us/forums/opencl/topic/509406 people talk about pinning/unpinning memory during reading/writing? Under which conditions do these functions use pinned memory? Or this is just the feature of intel implementation, where ALL memory transfer related functions use pinned memory and DMA?

Also, does it mean, that if I use pinned memory, then the DMA mechanism will work? And vice versa - if I want to have DMA working, I need pinned memory?

2) Is this migration function - exactly what happens inside enqueueRead/WriteBuffer functions without some additional overhead, which these enqueuRead/writeBuffer give? Does it always do DMA or it may also do copy?

For some reasons, some sources when talking about DMA transfer, use "copy", "memory", "migration" word for transferring the data between two buffers ( on the host and on the device). However, there cannot be any copy, we just write directly to the buffer without any copy at all. How does this write happen during DMA? Do we really use this function in order just to make DMA transfer of data, when using Map/Unmap functions?

What will happen, if I will use enqueueMigrateMemOjects with buffers, created with flag CL_MEM_USE_HOST_PTR?

3) With these two functions, there is total confusion. How the mapping and reading/writing will happen, if I use: a) existing host pointer or b) newly allocated host pointer?

Also here I do not properly understand how the DMA works. If I mapped my buffer on the host side to the buffer on the device side, with the help of which functions the memory is transferred between them in OpenCL? Should I always unmap my buffer after?

Ben Ashbaugh (Intel) wrote:
Hi Anastasiia,
I'm not an Intel FPGA expert so I can't speak to the FPGA OpenCL implementation, but our current CPU and GPU OpenCL devices use the same memory as the host processor, hence no "migration" is necessary. This will change for our discrete GPU devices, but we aren't ready to talk about that just yet.
The links Sergey sent previously may be helpful to see how this is currently implemented. Here are a few other links that I have found useful:
https://software.intel.com/en-us/file/the-compute-architecture-of-intel-... - This is a white paper that describes our Gen9 integrated GPU architecture, including its caches and memory hierarchy.
https://software.intel.com/en-us/articles/getting-the-most-from-opencl-1... - This is an article and code sample demonstrating how to access memory on the host processor and the OpenCL device without copies.
Thanks!

Ben_A_Intel · ‎09-11-2019

Hi Anastasiia,

It's hard to say anything too definitive for this topic because things like pinning and DMA are implementation details, and different OpenCL implementations will behave differently. In other words, the OpenCL specification is written to allow features like pinning and DMA on devices that support and/or require them, but it doesn't mandate their use.

Again, I can speak most confidently about the Intel CPU and integrated GPU implementations. On these implementations, the host and OpenCL device share physical memory, which means that many implicit copy operations may be omitted.

(1) The ReadBuffer and WriteBuffer APIs are explicit copies between an OpenCL buffer and an application pointer. Even on our integrated GPUs, this requires pinning (and unpinning) the application host pointer, plus a copy, which may be performed with a DMA operation or otherwise on the device.

(2) The Migration APIs are NOPs since there is nowhere to migrate the OpenCL buffer. If you profile these APIs you should see that they are effectively free.

(3) The Map and Unmap APIs are usually NOPs as well. The few exceptions are when the OpenCL device cannot (or chooses not to) directly use or provide access to the OpenCL buffer. This is rare, but it can happen if the passed-in pointer for a USE_HOST_PTR buffer is poorly aligned, for example. In these cases, Map and Unmap will generate a copy, which may be performed with a DMA operation or otherwise on the device.

OpenCL devices with dedicated device memory will usually generate additional implicit copies to move data between a (likely pinned) host-accessible representation of an OpenCL buffer and the representation in device memory. I think the thread linked above does a good job explaining how this works in the Xeon Phi OpenCL implementation. I suspect that other OpenCL implementations with dedicated device memory likely implement similar policies, but details may differ.

Hope this helps, happy to (try to) answer any follow-on questions. Thanks!