Pinning in place!

janez-makovsek · ‎06-29-2011

Hi!

AMD made the announcement to support pin-in-place of the buffers for Open CL. I would assume that this allows clEnqueueReadBuffer and clEnqueueWriteBuffer to return immediatley (currently they block for 1ms) because only the pointer would need to be copied. Is there such a support in Intel Open CL (or planned)?

Thanks!
Atmapuri

ARNON_P_Intel · ‎06-29-2011

We are not using this forum to comment on 3rd party implementation, or to disclosue our future plans. Sorry for that.

We continue to support developers throurgh this forum, and the engineering team will continue to provide you details on our current implemantation.

Regards,

- Arnon

Evgeny_F_Intel · ‎06-29-2011

Hi Atmapuri,

Can you alaborate more on pin-in-place feature? Is it public?

I want betterunderstandhow it can eliminate data copy during clEnqueRead/Write operations.

Thanks,
Evgeny

janez-makovsek · ‎06-29-2011

Hi!

Ok, I guess I can assume that this feature is not yet present or planned. I only read the announcement from AMD. Clearly, if the Open CL code and the host application share the same memory, there is no need to copy buffers. This is only required when the devices have physically separated memories. The feature also cannot be implemented to comply with Open CL standard. It has to be a vendor specific extension possibly breaking some other features of Open CL like shared context (of CPU) with physically (memory) separated (GPU) devices.

Nevertheless, if the buffer is to be used only by the Open CL driver which runs on the CPU, then such buffer copy operation can be reduced to simple pointer copy, thus completely eliminating the need for a thread lock (Sleep(1) takes minimum 1ms regadless of the size of the copy operation) and reducing clEnqueueRead/Write cost by ~1000x in to microsecond range to the same level as clSetKernelArg.

This is most usefull when GPU and CPU share the same memory (like in case of AMD Fusion design), where the overhead of copying buffers between GPU and CPU has the potential of being reduced to nearly nothing. (for Open CL based applications). It is also not much less usefull for pure CPU (no GPU) applications. (Like those targeted also by Intel Open CL driver).

I see two approach methods:
1.) You allow clEnqueueRead/Write to accept the pointer without copy
2.) You completely avoid clEnqeueRead/Write and allow passing a pointer to array directly to clSetKernelArg thus preserving all behaviour of clEnqueueRead/Write. Instead of a cl_mem buffer object, it could be simple memory pointer also.

Thanks!
Atmapuri

Doron_S_Intel · ‎06-29-2011

A good way to utilize a device that shares memory with the host is to allocate memory on the host with the proper (CL_DEVICE_MEM_BASE_ADDR_ALIGN) alignment - for the Intel SDK it's with 128 bytes alignment. Then, create a memory object using the flag CL_MEM_USE_HOST_PTR and the pointer, and from then on perform reads and writes using the clEnqueueMapXXX and the clEnqueueUnmapMemObject APIs. For the Intel SDK, improved performance is expected, compared to clRead/clWrite calls.

For more information about this, please check the optimization guide.

Doron

janez-makovsek · ‎06-29-2011

Thanks! I will try that. I would still prefer clSetKernelArg, which can take (properly alligned) HOST_PTR without further due.

janez-makovsek · ‎07-02-2011

It turns out that required allignemnt is 1024 bytes and not 128. Another thing is that this code:

cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
void *mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 1024*1024, 0, NULL, NULL, &error);
error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL);
error = clReleaseMemObject(buffer);

Runs 2x slower than AMD and still requires 70us in average. That is a far fetch from pointer dereference and enough to compute 8x 1024 point FFT. CreateBuffer/Release pair alone takes 40us.

Thanks!
Atmapuri

Doron_S_Intel · ‎07-02-2011

Hi Atmapuri,

Please read the spec again, it's 1024 bits, not bytes.
As for the measurements, am I correct in reading that a map/unmap buffer with the proper alignment takes ~30us to transfer data to/from the device? Is this close enough to "zero copy" performance?

Thanks,
Doron Singer

janez-makovsek · ‎07-03-2011

Hi,

In my view anything above 1us is too much for a CPU device. Why should there be delay for no obvious benefit? What are the possible side-effects, if the HOST_PTR pointer is reused without the map function, but simply ensuring that thecommand queue is clFinished?

Thanks!
Atmapuri

Evgeny_F_Intel · ‎07-03-2011

Hi Atmapuri,

Using map function is required by OpenCL spec. Especially for discrete device, even if USE_HOST_PTR flag used.
Of cause, when device uses same physical memory there is a place for optimizations.
We will evaluate the optimization opportunities and will integrate in our implementation if it will be possible.

Thanks,
Evgeny