Solved: ClCreateBuffer(| CL_MEM_USE_HOST_PTR): When does OpenCL framework transfer data to device via PCI?

Dave_O_ · ‎04-16-2014

Hi

Intel Xeon Phi OpenCL optimization guide suggests using Mapped buffers for data transfer between host and device memory. OpenCL spec also states that the technique is faster than having to write data explicitly to device memory. I am trying to measure the data transfer time from host-device, and from device-host.

My understanding is that OpenCL framework supports two ways of transferring data.

Here is my summarized scenario:

a. Explicit Method:

- Writing: ClWriteBuffer(...)

{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel) }

- Reading: ClReadBuffer(...)

Pretty simple.

b. Implicit Method:

- Writing: ClCreateBuffer(hostPtr, flag, ...) //Use flag CL_MEM_USE_USE_PTR. make sure to create aligned host buffer to map to.

{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel) }

- Reading: ClEnqueueMapBuffer(hostPtr, ...) //device relinquishes access to mapped memory back to host for reading processed data

Not very straight-forward.

I am using the second method. At what point does data transfer begin for both writing and reading? I need to insert timing code in the right place of my code in order to see how long it takes. So far, I have it inserted before ClEnqueueNDRangeKernel(kernel) for writing; and before ClEnqueueMapBuffer(hostPtr, ...) for reading. The numbers for my time are very small and I doubt that those are the points where data transmission from host to device memory (for this implicit method) actually begin.

Any clarifications on this towards profiling the data transfer involving the use of these three API commands will be greatly appreciated.

Thanks,
Dave

Dmitry_K_Intel · ‎04-24-2014

Dave O. wrote:

**Regarding implicit data-out transfer (clEnqueueMapBuffer/clEnqueueUnMapBuffer)**

1, 3, 4, 5: cleared, thank you.

2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished.

- Okay. At which API function call does this happen: ClCreateKernelArguments or ClEnqueueNDRangeKernel ?

It will not happen at user API. It will happen internally immediately after NDRangeKernel command becomes ready for execution. In the case of single in-order queue and no dependencies NDRangeKernel command becomes READY after the previous command in the same queue becomes COMPLETED.

Dave O. wrote:

***Regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer***

1. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after.

- so clEnqueWriteBuffer/clEnqueReadBuffer use pinned memory on demand; with that I assume they use DMA as well just like clEnqueueMapBuffer/clEnqueueUnMapBuffer ? (Xeon Phi opencl optimization guide suggests using the later because it is faster. )

clEnqueWriteBuffer/clEnqueReadBuffer MUST pin/unpin during each execution because host memory that is used as a data source/target may be not pinned. You can use workaround from Xeon Phi opencl optimization guide to avoid extra pinning. clEnqueueMapBuffer/clEnqueueUnMapBuffer ALWAYS pin only once - either during clBufferCreate if CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR were used or during the first invocation Actually if you use workaround from Xeon Phi opencl optimization guide Write/Read are quite similar to Map/Unmap

Dave O. wrote:

2. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock.

- Okay. Even if clEnqueueReadBuffer is used to read data from device to memory, If the blocking is not used, the device might write to the same buffer before host has finished reading from it. That is, assuming a pipelining processing where the kernel is continuously processing input buffer and writing its result to the output buffer. With that, data might get overwritten by the device before host finishes reading. Thus, it seems that clEnqueueMapBuffer/clEnqueueUnMapBuffer are good synch mechanisms (as previously explained in one of your posts) in the absence of blocking read for clEnqueueReadBuffer; the downside of clEnqueueUnMapBuffer of course being that it would have to transfer entire data back to device.

clEnqueueMapBuffer and clEnqueueUnmapBuffer behavior depending on clEnqueueMapBuffer flags: CL_MAP_READ - DMA device-to-host during Map, no-op during Unmap CL_MAP_WRITE - DMA device-to-host during Map, DMA host-to-device during Unmap CL_MAP_WRITE_INVALIDATE_REGION - no-op during Map, DMA host-to-device during Unmap

Dave O. wrote:

**Regarding profiling**

thanks. Clear. I use the two methods described. Events profiling on device side, and host-base timing. However, in your earlier post, you mentioned that event-profiling is slow due to creation of internal sync points on device, and that host-based profiling might be slightly inaccurate because it forces all opencl optimzation to be disabled (and by that you meant due use of clFinish?)

I mean use of clFinish or any other blocking operation. OpenCL Xeon Phi device tries to shorten device idle time by starting the next command as fast as possible after prevoious. If user inserts clFinish or any other blocking operation between commands such optimizations become impossible. I propose to use clFinish or any other blocking operation only at the very end of the algorithm if possible.

View solution in original post

Dave_O_ · ‎04-16-2014

quick type correction:
*ClWriteBuffer -> ClEnqueueWriteBuffer
*ClReadBuffer -> ClEnqueueReadBuffer

Dmitry_K_Intel · ‎04-16-2014

Hi Dave,

If you are using regular in order queue, each next command in the queue starts execution imemdiately after previous one finished. So in the following sequence:

clEnqueueWriteBuffer

clEnqueueNDRange

clEnqueueReadBuffer

there is no place for you to put measurements as you cannot discover when each previous command finishes. Even more - you cannot discover when last Read finishes unless you use a blocking Read. The same is right for the Map/Unmap sequence:

clEnqueueNDRange

clEnqueueMapBuffer

According to OpenCL spec you can do the measurements in 2 ways:

1. Use OpenCL events profiling. Unfortunatly using profiling slows down execution as it enforces OpenCL implementation to create internal synchronization points. Also one Intel OpenCL Xeon Phi implementation issue: NDRange profiling does not include data transfer.

2. Use manual synchronization points. Drawback - all internal OpenCL implementation optimizations will be disabled:

// ensure queue is empty

clFinish()

read-time-counter

clEnqueueNDRange()

clFinish()

read-time-counter

clEnqueueMapBuffer( blocking )

read-time-counter

Just some info about Intel OpenCL for Xeon Phi implementation - data is transferred to the device either by explicit request by user or implicitly, before it is really required on device. In the case with MapBuffer use clEnqueueMigrateMemObjects to force initial data transfer to device. clEnqueueUnmapBuffer is also considered as explicit data transfer request from host to the device where if was mapped from.

David_O_ · ‎04-23-2014

Dmitry,

Thank you.

Do you mean that (for implicit method), ClCreateBuffer(hostPtr, CL_MEM_USE_USE_PTR, ...):

- does not cause any data transfer from host memory to device DDR?

- does this mean that the kernel running on the device would then use host memory directly (instead)?

- if so, via DMA? Because if not, performance should be poor due to the long memory access latency involving in reaching the host (pinned memory?)

I have some results from this scenario. The performance seems to be actually good. I am talking about raw kernel execution time. I still need to ascertain data transfer doubts listed above.

For a second scenario, you suggested using clEnqueueMigrateMemObjects to transfer data explicitly from host memory to mapped device memory.

- Is it any different from clEnqueuWriteBuffer for non-mapped device memory objects? Which is better in terms of performance for Intel architectures (or Phi)?

- will I have to call clEnqueueMigrateMemObjects anytime the host buffer changes? (Much like calling clEnqueuWriteBuffer anytime I need to write new data to the device.)

Please kindly clarify.

Thanks

Dave

Dmitry_K_Intel · ‎04-24-2014

Hi Dave, My answer is relevant for OpenCL for Intel CPU and Intel Xeon Phi implementations only. Note that CPU device and host share the same physical and virtual memory so data transfer between them is a no-op. 1. clCreateBuffer(hostPtr, CL_MEM_USE_USE_PTR, ...) does not perform any data transfer at buffer creation time. 2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished. 3. After the first usage data will remain on device until one of 2 events happens: - user requests the data back to host using Map operation - device is running out of memory 4. In the case of Map operation data is transferred to the host and remains there until UnMap. UnMap operation triggers DMA back to device if required (ex. Map was requested for data modification) 5. In the case data was swapped out to host because of memory pressure on device it will be DMA'ed back only on the first real usage 6. You can use clEnqueueMigrateMemObjects at any time to trigger DMA to the required device ahead of a time, except when buffer is currently owned by host by mean of a map operation. Now regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer 1. DMA requires host pages pinning. For buffers created with CL_MEM_USE_HOST_PTR/CL_MEM_ALLOC_HOST_PTR this pinning is performed at clBufferCreate time frame. For other buffers it is performed on demand only. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after. For clEnqueueMapBuffer Intel OpenCL for Xeon Phi implementation does the host pinning only once but for the whole buffer area. 2. If you need to scatter your data on the host or copy it each time to different host locations I propose you to use Read/Write. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock. BTW, I propose you to view clEnqueueMapBuffer/clEnqueueUnMapBuffer method as ownership transfer between host and device. This may greatly simplify the understanding. Profiling: I think the only profiling method that may give you sustainable results for small numbers is to profile the whole algorithm and not each operation. This means you do buffers create, start profiling, do everything, clFinish, stop profiling. If you want to profile specific data transfer command use either OpenCL events profiling method or wrap it with clFinish, start profiling, enqueue command, clFinish, stop profiling Please note that profiling numbers will be highly volatile for small buffers.

Dmitry_K_Intel · ‎04-24-2014

Important difference between OpenCL events profiling method and host-based time difference method: OpenCL events profiling method provides distinct measurements for each stage of the OpenCL pipeline: CL_PROFILING_COMMAND_QUEUED CL_PROFILING_COMMAND_SUBMIT CL_PROFILING_COMMAND_START CL_PROFILING_COMMAND_END Host timer difference method provides only one global measurement that includes all stages.

Dave_O_ · ‎04-24-2014

**Regarding implicit data-out transfer (clEnqueueMapBuffer/clEnqueueUnMapBuffer)**

1, 3, 4, 5: cleared, thank you.

2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished.

- Okay. At which API function call does this happen: ClCreateKernelArguments or ClEnqueueNDRangeKernel ?

***Regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer***

1. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after.

- so clEnqueWriteBuffer/clEnqueReadBuffer use pinned memory on demand; with that I assume they use DMA as well just like clEnqueueMapBuffer/clEnqueueUnMapBuffer ? (Xeon Phi opencl optimization guide suggests using the later because it is faster. )

2. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock.

- Okay. Even if clEnqueueReadBuffer is used to read data from device to memory, If the blocking is not used, the device might write to the same buffer before host has finished reading from it. That is, assuming a pipelining processing where the kernel is continuously processing input buffer and writing its result to the output buffer. With that, data might get overwritten by the device before host finishes reading. Thus, it seems that clEnqueueMapBuffer/clEnqueueUnMapBuffer are good synch mechanisms (as previously explained in one of your posts) in the absence of blocking read for clEnqueueReadBuffer; the downside of clEnqueueUnMapBuffer of course being that it would have to transfer entire data back to device.

**Regarding profiling**
thanks. Clear. I use the two methods described. Events profiling on device side, and host-base timing. However, in your earlier post, you mentioned that event-profiling is slow due to creation of internal sync points on device, and that host-based profiling might be slightly inaccurate because it forces all opencl optimzation to be disabled (and by that you meant due use of clFinish?)

Dmitry_K_Intel · ‎04-24-2014

Dave O. wrote:

**Regarding implicit data-out transfer (clEnqueueMapBuffer/clEnqueueUnMapBuffer)**

1, 3, 4, 5: cleared, thank you.

2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished.

- Okay. At which API function call does this happen: ClCreateKernelArguments or ClEnqueueNDRangeKernel ?

It will not happen at user API. It will happen internally immediately after NDRangeKernel command becomes ready for execution. In the case of single in-order queue and no dependencies NDRangeKernel command becomes READY after the previous command in the same queue becomes COMPLETED.

Dave O. wrote:

***Regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer***

1. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after.

- so clEnqueWriteBuffer/clEnqueReadBuffer use pinned memory on demand; with that I assume they use DMA as well just like clEnqueueMapBuffer/clEnqueueUnMapBuffer ? (Xeon Phi opencl optimization guide suggests using the later because it is faster. )

clEnqueWriteBuffer/clEnqueReadBuffer MUST pin/unpin during each execution because host memory that is used as a data source/target may be not pinned. You can use workaround from Xeon Phi opencl optimization guide to avoid extra pinning. clEnqueueMapBuffer/clEnqueueUnMapBuffer ALWAYS pin only once - either during clBufferCreate if CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR were used or during the first invocation Actually if you use workaround from Xeon Phi opencl optimization guide Write/Read are quite similar to Map/Unmap

Dave O. wrote:

2. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock.

- Okay. Even if clEnqueueReadBuffer is used to read data from device to memory, If the blocking is not used, the device might write to the same buffer before host has finished reading from it. That is, assuming a pipelining processing where the kernel is continuously processing input buffer and writing its result to the output buffer. With that, data might get overwritten by the device before host finishes reading. Thus, it seems that clEnqueueMapBuffer/clEnqueueUnMapBuffer are good synch mechanisms (as previously explained in one of your posts) in the absence of blocking read for clEnqueueReadBuffer; the downside of clEnqueueUnMapBuffer of course being that it would have to transfer entire data back to device.

clEnqueueMapBuffer and clEnqueueUnmapBuffer behavior depending on clEnqueueMapBuffer flags: CL_MAP_READ - DMA device-to-host during Map, no-op during Unmap CL_MAP_WRITE - DMA device-to-host during Map, DMA host-to-device during Unmap CL_MAP_WRITE_INVALIDATE_REGION - no-op during Map, DMA host-to-device during Unmap

Dave O. wrote:

**Regarding profiling**

thanks. Clear. I use the two methods described. Events profiling on device side, and host-base timing. However, in your earlier post, you mentioned that event-profiling is slow due to creation of internal sync points on device, and that host-based profiling might be slightly inaccurate because it forces all opencl optimzation to be disabled (and by that you meant due use of clFinish?)

I mean use of clFinish or any other blocking operation. OpenCL Xeon Phi device tries to shorten device idle time by starting the next command as fast as possible after prevoious. If user inserts clFinish or any other blocking operation between commands such optimizations become impossible. I propose to use clFinish or any other blocking operation only at the very end of the algorithm if possible.

Dave_O_ · ‎04-24-2014

Brilliant.

Dave_O_ · ‎04-24-2014

Dmitry,
I have a related question here in a separate thread: https://software.intel.com/en-us/forums/topic/509816. Any input?