- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
AMD made the announcement to support pin-in-place of the buffers for Open CL. I would assume that this allows clEnqueueReadBuffer and clEnqueueWriteBuffer to return immediatley (currently they block for 1ms) because only the pointer would need to be copied. Is there such a support in Intel Open CL (or planned)?
Thanks!
Atmapuri
AMD made the announcement to support pin-in-place of the buffers for Open CL. I would assume that this allows clEnqueueReadBuffer and clEnqueueWriteBuffer to return immediatley (currently they block for 1ms) because only the pointer would need to be copied. Is there such a support in Intel Open CL (or planned)?
Thanks!
Atmapuri
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are not using this forum to comment on 3rd party implementation, or to disclosue our future plans. Sorry for that.
We continue to support developers throurgh this forum, and the engineering team will continue to provide you details on our current implemantation.
Regards,
- Arnon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Atmapuri,
Can you alaborate more on pin-in-place feature? Is it public?
I want betterunderstandhow it can eliminate data copy during clEnqueRead/Write operations.
Thanks,
Evgeny
Can you alaborate more on pin-in-place feature? Is it public?
I want betterunderstandhow it can eliminate data copy during clEnqueRead/Write operations.
Thanks,
Evgeny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
Ok, I guess I can assume that this feature is not yet present or planned. I only read the announcement from AMD. Clearly, if the Open CL code and the host application share the same memory, there is no need to copy buffers. This is only required when the devices have physically separated memories. The feature also cannot be implemented to comply with Open CL standard. It has to be a vendor specific extension possibly breaking some other features of Open CL like shared context (of CPU) with physically (memory) separated (GPU) devices.
Nevertheless, if the buffer is to be used only by the Open CL driver which runs on the CPU, then such buffer copy operation can be reduced to simple pointer copy, thus completely eliminating the need for a thread lock (Sleep(1) takes minimum 1ms regadless of the size of the copy operation) and reducing clEnqueueRead/Write cost by ~1000x in to microsecond range to the same level as clSetKernelArg.
This is most usefull when GPU and CPU share the same memory (like in case of AMD Fusion design), where the overhead of copying buffers between GPU and CPU has the potential of being reduced to nearly nothing. (for Open CL based applications). It is also not much less usefull for pure CPU (no GPU) applications. (Like those targeted also by Intel Open CL driver).
I see two approach methods:
1.) You allow clEnqueueRead/Write to accept the pointer without copy
2.) You completely avoid clEnqeueRead/Write and allow passing a pointer to array directly to clSetKernelArg thus preserving all behaviour of clEnqueueRead/Write. Instead of a cl_mem buffer object, it could be simple memory pointer also.
Thanks!
Atmapuri
Ok, I guess I can assume that this feature is not yet present or planned. I only read the announcement from AMD. Clearly, if the Open CL code and the host application share the same memory, there is no need to copy buffers. This is only required when the devices have physically separated memories. The feature also cannot be implemented to comply with Open CL standard. It has to be a vendor specific extension possibly breaking some other features of Open CL like shared context (of CPU) with physically (memory) separated (GPU) devices.
Nevertheless, if the buffer is to be used only by the Open CL driver which runs on the CPU, then such buffer copy operation can be reduced to simple pointer copy, thus completely eliminating the need for a thread lock (Sleep(1) takes minimum 1ms regadless of the size of the copy operation) and reducing clEnqueueRead/Write cost by ~1000x in to microsecond range to the same level as clSetKernelArg.
This is most usefull when GPU and CPU share the same memory (like in case of AMD Fusion design), where the overhead of copying buffers between GPU and CPU has the potential of being reduced to nearly nothing. (for Open CL based applications). It is also not much less usefull for pure CPU (no GPU) applications. (Like those targeted also by Intel Open CL driver).
I see two approach methods:
1.) You allow clEnqueueRead/Write to accept the pointer without copy
2.) You completely avoid clEnqeueRead/Write and allow passing a pointer to array directly to clSetKernelArg thus preserving all behaviour of clEnqueueRead/Write. Instead of a cl_mem buffer object, it could be simple memory pointer also.
Thanks!
Atmapuri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A good way to utilize a device that shares memory with the host is to allocate memory on the host with the proper (CL_DEVICE_MEM_BASE_ADDR_ALIGN) alignment - for the Intel SDK it's with 128 bytes alignment. Then, create a memory object using the flag CL_MEM_USE_HOST_PTR and the pointer, and from then on perform reads and writes using the clEnqueueMapXXX and the clEnqueueUnmapMemObject APIs. For the Intel SDK, improved performance is expected, compared to clRead/clWrite calls.
For more information about this, please check the optimization guide.
Doron
For more information about this, please check the optimization guide.
Doron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I will try that. I would still prefer clSetKernelArg, which can take (properly alligned) HOST_PTR without further due.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It turns out that required allignemnt is 1024 bytes and not 128. Another thing is that this code:
cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
void *mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 1024*1024, 0, NULL, NULL, &error);
error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL);
error = clReleaseMemObject(buffer);
Runs 2x slower than AMD and still requires 70us in average. That is a far fetch from pointer dereference and enough to compute 8x 1024 point FFT. CreateBuffer/Release pair alone takes 40us.
Thanks!
Atmapuri
cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
void *mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 1024*1024, 0, NULL, NULL, &error);
error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL);
error = clReleaseMemObject(buffer);
Runs 2x slower than AMD and still requires 70us in average. That is a far fetch from pointer dereference and enough to compute 8x 1024 point FFT. CreateBuffer/Release pair alone takes 40us.
Thanks!
Atmapuri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Atmapuri,
Please read the spec again, it's 1024 bits, not bytes.
As for the measurements, am I correct in reading that a map/unmap buffer with the proper alignment takes ~30us to transfer data to/from the device? Is this close enough to "zero copy" performance?
Thanks,
Doron Singer
Please read the spec again, it's 1024 bits, not bytes.
As for the measurements, am I correct in reading that a map/unmap buffer with the proper alignment takes ~30us to transfer data to/from the device? Is this close enough to "zero copy" performance?
Thanks,
Doron Singer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In my view anything above 1us is too much for a CPU device. Why should there be delay for no obvious benefit? What are the possible side-effects, if the HOST_PTR pointer is reused without the map function, but simply ensuring that thecommand queue is clFinished?
Thanks!
Atmapuri
In my view anything above 1us is too much for a CPU device. Why should there be delay for no obvious benefit? What are the possible side-effects, if the HOST_PTR pointer is reused without the map function, but simply ensuring that thecommand queue is clFinished?
Thanks!
Atmapuri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Atmapuri,
Using map function is required by OpenCL spec. Especially for discrete device, even if USE_HOST_PTR flag used.
Of cause, when device uses same physical memory there is a place for optimizations.
We will evaluate the optimization opportunities and will integrate in our implementation if it will be possible.
Thanks,
Evgeny
Using map function is required by OpenCL spec. Especially for discrete device, even if USE_HOST_PTR flag used.
Of cause, when device uses same physical memory there is a place for optimizations.
We will evaluate the optimization opportunities and will integrate in our implementation if it will be possible.
Thanks,
Evgeny

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page