- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Intel Xeon Phi OpenCL optimization guide suggests using Mapped buffers for data transfer between host and device memory. OpenCL spec also states that the technique is faster than having to write data explicitly to device memory. I am trying to measure the data transfer time from host-device, and from device-host. 
	
	My understanding is that OpenCL framework supports two ways of transferring data.
Here is my summarized scenario:
a. Explicit Method:
- Writing: ClWriteBuffer(...)
{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel) }
- Reading: ClReadBuffer(...)
Pretty simple.
b. Implicit Method:
- Writing: ClCreateBuffer(hostPtr, flag, ...) //Use flag CL_MEM_USE_USE_PTR. make sure to create aligned host buffer to map to.
{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel) }
- Reading: ClEnqueueMapBuffer(hostPtr, ...) //device relinquishes access to mapped memory back to host for reading processed data
Not very straight-forward.
I am using the second method. At what point does data transfer begin for both writing and reading? I need to insert timing code in the right place of my code in order to see how long it takes. So far, I have it inserted before ClEnqueueNDRangeKernel(kernel)  for writing; and before ClEnqueueMapBuffer(hostPtr, ...) for reading. The numbers for my time are very small and I doubt that those are the points where data transmission from host to device memory (for this implicit method) actually begin.
	
	Any clarifications on this towards profiling the data transfer involving the use of these three API commands will be greatly appreciated. 
	
	Thanks,
	Dave 
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dave O. wrote:It will not happen at user API. It will happen internally immediately after NDRangeKernel command becomes ready for execution. In the case of single in-order queue and no dependencies NDRangeKernel command becomes READY after the previous command in the same queue becomes COMPLETED.
**Regarding implicit data-out transfer (clEnqueueMapBuffer/clEnqueueUnMapBuffer)**
1, 3, 4, 5: cleared, thank you.
2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished.
- Okay. At which API function call does this happen: ClCreateKernelArguments or ClEnqueueNDRangeKernel ?
Dave O. wrote:clEnqueWriteBuffer/clEnqueReadBuffer MUST pin/unpin during each execution because host memory that is used as a data source/target may be not pinned. You can use workaround from Xeon Phi opencl optimization guide to avoid extra pinning. clEnqueueMapBuffer/clEnqueueUnMapBuffer ALWAYS pin only once - either during clBufferCreate if CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR were used or during the first invocation Actually if you use workaround from Xeon Phi opencl optimization guide Write/Read are quite similar to Map/Unmap
***Regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer***
1. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after.
- so clEnqueWriteBuffer/clEnqueReadBuffer use pinned memory on demand; with that I assume they use DMA as well just like clEnqueueMapBuffer/clEnqueueUnMapBuffer ? (Xeon Phi opencl optimization guide suggests using the later because it is faster. )
Dave O. wrote:clEnqueueMapBuffer and clEnqueueUnmapBuffer behavior depending on clEnqueueMapBuffer flags: CL_MAP_READ - DMA device-to-host during Map, no-op during Unmap CL_MAP_WRITE - DMA device-to-host during Map, DMA host-to-device during Unmap CL_MAP_WRITE_INVALIDATE_REGION - no-op during Map, DMA host-to-device during Unmap
2. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock.
- Okay. Even if clEnqueueReadBuffer is used to read data from device to memory, If the blocking is not used, the device might write to the same buffer before host has finished reading from it. That is, assuming a pipelining processing where the kernel is continuously processing input buffer and writing its result to the output buffer. With that, data might get overwritten by the device before host finishes reading. Thus, it seems that clEnqueueMapBuffer/clEnqueueUnMapBuffer are good synch mechanisms (as previously explained in one of your posts) in the absence of blocking read for clEnqueueReadBuffer; the downside of clEnqueueUnMapBuffer of course being that it would have to transfer entire data back to device.
Dave O. wrote:I mean use of clFinish or any other blocking operation. OpenCL Xeon Phi device tries to shorten device idle time by starting the next command as fast as possible after prevoious. If user inserts clFinish or any other blocking operation between commands such optimizations become impossible. I propose to use clFinish or any other blocking operation only at the very end of the algorithm if possible.
**Regarding profiling**
thanks. Clear. I use the two methods described. Events profiling on device side, and host-base timing. However, in your earlier post, you mentioned that event-profiling is slow due to creation of internal sync points on device, and that host-based profiling might be slightly inaccurate because it forces all opencl optimzation to be disabled (and by that you meant due use of clFinish?)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
quick type correction:
	*ClWriteBuffer -> ClEnqueueWriteBuffer
	*ClReadBuffer -> ClEnqueueReadBuffer
	 
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dave,
If you are using regular in order queue, each next command in the queue starts execution imemdiately after previous one finished. So in the following sequence:
clEnqueueWriteBuffer
clEnqueueNDRange
clEnqueueReadBuffer
there is no place for you to put measurements as you cannot discover when each previous command finishes. Even more - you cannot discover when last Read finishes unless you use a blocking Read. The same is right for the Map/Unmap sequence:
clEnqueueNDRange
clEnqueueMapBuffer
According to OpenCL spec you can do the measurements in 2 ways:
1. Use OpenCL events profiling. Unfortunatly using profiling slows down execution as it enforces OpenCL implementation to create internal synchronization points. Also one Intel OpenCL Xeon Phi implementation issue: NDRange profiling does not include data transfer.
2. Use manual synchronization points. Drawback - all internal OpenCL implementation optimizations will be disabled:
// ensure queue is empty
clFinish()
read-time-counter
clEnqueueNDRange()
clFinish()
read-time-counter
clEnqueueMapBuffer( blocking )
read-time-counter
Just some info about Intel OpenCL for Xeon Phi implementation - data is transferred to the device either by explicit request by user or implicitly, before it is really required on device. In the case with MapBuffer use clEnqueueMigrateMemObjects to force initial data transfer to device. clEnqueueUnmapBuffer is also considered as explicit data transfer request from host to the device where if was mapped from.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
Thank you.
Do you mean that (for implicit method), ClCreateBuffer(hostPtr, CL_MEM_USE_USE_PTR, ...):
- does not cause any data transfer from host memory to device DDR?
- does this mean that the kernel running on the device would then use host memory directly (instead)?
- if so, via DMA? Because if not, performance should be poor due to the long memory access latency involving in reaching the host (pinned memory?)
I have some results from this scenario. The performance seems to be actually good. I am talking about raw kernel execution time. I still need to ascertain data transfer doubts listed above.
For a second scenario, you suggested using clEnqueueMigrateMemObjects to transfer data explicitly from host memory to mapped device memory.
- Is it any different from clEnqueuWriteBuffer for non-mapped device memory objects? Which is better in terms of performance for Intel architectures (or Phi)?
- will I have to call clEnqueueMigrateMemObjects anytime the host buffer changes? (Much like calling clEnqueuWriteBuffer anytime I need to write new data to the device.)
Please kindly clarify.
Thanks
Dave
	 
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
**Regarding implicit data-out transfer (clEnqueueMapBuffer/clEnqueueUnMapBuffer)**
1, 3, 4, 5: cleared, thank you.
2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished.
- Okay. At which API function call does this happen: ClCreateKernelArguments or ClEnqueueNDRangeKernel ?
***Regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer***
1. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after.
- so clEnqueWriteBuffer/clEnqueReadBuffer use pinned memory on demand; with that I assume they use DMA as well just like clEnqueueMapBuffer/clEnqueueUnMapBuffer ? (Xeon Phi opencl optimization guide suggests using the later because it is faster. )
2. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock.
    - Okay. Even if clEnqueueReadBuffer is used to read data from device to memory, If the blocking is not used, the device might write to the same buffer before host has finished reading from it. That is, assuming a pipelining processing where the kernel is continuously processing input buffer and writing its result to the output buffer. With that, data might get overwritten by the device before host finishes reading. Thus, it seems that clEnqueueMapBuffer/clEnqueueUnMapBuffer are good synch mechanisms (as previously explained in one of your posts) in the absence of blocking read for clEnqueueReadBuffer; the downside of clEnqueueUnMapBuffer of course being that it would have to transfer entire data back to device. 
	
	**Regarding profiling**
	thanks. Clear. I use the two methods described. Events profiling on device side, and host-base timing. However, in your earlier post, you mentioned that event-profiling is slow due to creation of internal sync points on device, and that host-based profiling might be slightly inaccurate because it forces all opencl optimzation to be disabled (and by that you meant due use of clFinish?)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dave O. wrote:It will not happen at user API. It will happen internally immediately after NDRangeKernel command becomes ready for execution. In the case of single in-order queue and no dependencies NDRangeKernel command becomes READY after the previous command in the same queue becomes COMPLETED.
**Regarding implicit data-out transfer (clEnqueueMapBuffer/clEnqueueUnMapBuffer)**
1, 3, 4, 5: cleared, thank you.
2. Data is transferred from host to device using DMA before the first actual device usage. This means that the first kernel that has this buffer as a parameter will be paused, data transfer launched and kernel resumed after the data transfer finished.
- Okay. At which API function call does this happen: ClCreateKernelArguments or ClEnqueueNDRangeKernel ?
Dave O. wrote:clEnqueWriteBuffer/clEnqueReadBuffer MUST pin/unpin during each execution because host memory that is used as a data source/target may be not pinned. You can use workaround from Xeon Phi opencl optimization guide to avoid extra pinning. clEnqueueMapBuffer/clEnqueueUnMapBuffer ALWAYS pin only once - either during clBufferCreate if CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR were used or during the first invocation Actually if you use workaround from Xeon Phi opencl optimization guide Write/Read are quite similar to Map/Unmap
***Regarding clEnqueueWriteBuffer/clEnqueueReadBuffer vs clEnqueueMapBuffer/clEnqueueUnMapBuffer***
1. Note that clEnqueWriteBuffer/clEnqueReadBuffer operations must pin each time different host memory areas as part of their operations and immediately unpin after.
- so clEnqueWriteBuffer/clEnqueReadBuffer use pinned memory on demand; with that I assume they use DMA as well just like clEnqueueMapBuffer/clEnqueueUnMapBuffer ? (Xeon Phi opencl optimization guide suggests using the later because it is faster. )
Dave O. wrote:clEnqueueMapBuffer and clEnqueueUnmapBuffer behavior depending on clEnqueueMapBuffer flags: CL_MAP_READ - DMA device-to-host during Map, no-op during Unmap CL_MAP_WRITE - DMA device-to-host during Map, DMA host-to-device during Unmap CL_MAP_WRITE_INVALIDATE_REGION - no-op during Map, DMA host-to-device during Unmap
2. Note that clEnqueueMapBuffer transfers data ownership to the host and device cannot access it until clEnqueueUnMapBuffer. Use clEnqueueReadBuffer to break this lock.
- Okay. Even if clEnqueueReadBuffer is used to read data from device to memory, If the blocking is not used, the device might write to the same buffer before host has finished reading from it. That is, assuming a pipelining processing where the kernel is continuously processing input buffer and writing its result to the output buffer. With that, data might get overwritten by the device before host finishes reading. Thus, it seems that clEnqueueMapBuffer/clEnqueueUnMapBuffer are good synch mechanisms (as previously explained in one of your posts) in the absence of blocking read for clEnqueueReadBuffer; the downside of clEnqueueUnMapBuffer of course being that it would have to transfer entire data back to device.
Dave O. wrote:I mean use of clFinish or any other blocking operation. OpenCL Xeon Phi device tries to shorten device idle time by starting the next command as fast as possible after prevoious. If user inserts clFinish or any other blocking operation between commands such optimizations become impossible. I propose to use clFinish or any other blocking operation only at the very end of the algorithm if possible.
**Regarding profiling**
thanks. Clear. I use the two methods described. Events profiling on device side, and host-base timing. However, in your earlier post, you mentioned that event-profiling is slow due to creation of internal sync points on device, and that host-based profiling might be slightly inaccurate because it forces all opencl optimzation to be disabled (and by that you meant due use of clFinish?)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Brilliant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
	I have a related question here in a separate thread: https://software.intel.com/en-us/forums/topic/509816. Any input? 
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page