First of all, of course you can share the same buffer between CPU and GPU devices, it's allowed by OpenCL. You should simply create OpenCL context associated with both CPU and GPU and create buffer in this context. Then you can use this buffer by all devices associated with context.
I have a further question about memory sharing: Is it okay to run a kernel on each device - CPU & GPU - that write to the same buffer but different addresses? What if the CPU and GPU write to two (different) addresses that are in the same cache line? Is this handled correctly?
I didn't try but i will try to answer and share my thoughts.
First of all it will be a bad programming style. Secondly and most importantly the behaviour will be undefined according to OpenCL specification.OpenCL uses a relaxed consistency memory model and memory consistency for memory objects shared between queued commands are onlyguaranteed at synchronization points (clFinish, clWaitForEvent, event dependency, etc). So in case of one buffer with different offsets on each device the behaviour is undefined. You can get more info in "Memory Model.Memory Consistency" chapter of OpenCL specification.
There is special mechanizm for this purpose. It is sub-buffer objects. See "Buffer Objects.Creating Buffer Objects" chapter of specification for more detailes. There is very important note in the end ofclCreateSubBuffer description. It says:
Concurrent reading from, writing to and copying between both a buffer object and its sub-bufferobject(s) is undefined. Concurrent reading from, writing to and copying between overlappingsub-buffer objects created with the same buffer object is undefined. Only reading from both abuffer object and its sub-buffer objects or reading from multiple overlapping sub-buffer objectsis defined.
So, if your sub-buffer objects not overlapped everything should be fine.
Moreover there is limitation on offset of sub-buffer, it must be even multiple byCL_DEVICE_MEM_BASE_ADDR_ALIGN parameter of device. If you will try to execute kernel that uses sub-buffer on device for which offset of this sub-buffer is misaligned you will get CL_MISALIGNED_SUB_BUFFER_OFFSET error. I think this is done specially to take into account different cache alignments and other memory access mechanisms of devices. I suppose for Intel devicesCL_DEVICE_MEM_BASE_ADDR_ALIGNis exactly equal to cache line size.
I think you can properly devide you memory object taking into account all requirements. And then use this sub-buffers to process data on different devices. Everythyng should be fine.
Do you know if it makes a difference in performance if one uses eg USE_HOST_PTR vs creating a separate memory buffer and copying data explicitly? On AMD's Llano and Trinity systems it seems to matter how memory is allocated (see discussion here: http://devgurus.amd.com/message/1282235). Is this the same on Ivy Bridge?
I'm morefamiliar with AMD technologies ))) But I'll try to answer again (=
It's depend on your task and scenario of global memory usage. According to Intel Optimization Guidebecause Intel uses true memory sharing across host and all devices you can allocate buffer by OpenCL runtime and then map/unmap it without extra copy (it is simplest way). If you need to use specific memory managment in your program you can allocate memory by yourself and then paste it when creating buffer using CL_MEM_USE_HOST_PTR flag but memory must be properly aligned,4096 bytes if i'm right. Only in this case you definitelyavoid extra memory copies.