unified address space?

dominik_grewe · ‎06-19-2012

I'm trying to understand the memory system in Ivy Bridge with regards to OpenCL execution on CPU and GPU. In particular I'm interested in the following questions:

Is the address space for the CPU and GPU on Ivy Bridge unified?
Can the CPU and GPU share the same buffers in OpenCL?

Thanks
Dominik

EvgeniyPeshkov · ‎06-21-2012

HiDominik!

First of all, of course you can share the same buffer between CPU and GPU devices, it's allowed by OpenCL. You should simply create OpenCL context associated with both CPU and GPU and create buffer in this context. Then you can use this buffer by all devices associated with context.

Intel CPU and GPU use the same physical memory so memory objects are effectively shared by default and no implicit memory transfer happens between devices or on map/unmap calls. You can read this inOpenCL* Optimization GuideinSharing Resources Efficientlytopic.

dominik_grewe · ‎06-22-2012

Great, thanks for your reply!

I have a further question about memory sharing: Is it okay to run a kernel on each device - CPU & GPU - that write to the same buffer but different addresses? What if the CPU and GPU write to two (different) addresses that are in the same cache line? Is this handled correctly?

EvgeniyPeshkov · ‎06-22-2012

I didn't try but i will try to answer and share my thoughts.

First of all it will be a bad programming style. Secondly and most importantly the behaviour will be undefined according to OpenCL specification.OpenCL uses a relaxed consistency memory model and memory consistency for memory objects shared between queued commands are onlyguaranteed at synchronization points (clFinish, clWaitForEvent, event dependency, etc). So in case of one buffer with different offsets on each device the behaviour is undefined. You can get more info in "Memory Model.Memory Consistency" chapter of OpenCL specification.

There is special mechanizm for this purpose. It is sub-buffer objects. See "Buffer Objects.Creating Buffer Objects" chapter of specification for more detailes. There is very important note in the end ofclCreateSubBuffer description. It says:

Concurrent reading from, writing to and copying between both a buffer object and its sub-bufferobject(s) is undefined. Concurrent reading from, writing to and copying between overlappingsub-buffer objects created with the same buffer object is undefined. Only reading from both abuffer object and its sub-buffer objects or reading from multiple overlapping sub-buffer objectsis defined.

So, if your sub-buffer objects not overlapped everything should be fine.

Moreover there is limitation on offset of sub-buffer, it must be even multiple byCL_DEVICE_MEM_BASE_ADDR_ALIGN parameter of device. If you will try to execute kernel that uses sub-buffer on device for which offset of this sub-buffer is misaligned you will get CL_MISALIGNED_SUB_BUFFER_OFFSET error. I think this is done specially to take into account different cache alignments and other memory access mechanisms of devices. I suppose for Intel devicesCL_DEVICE_MEM_BASE_ADDR_ALIGNis exactly equal to cache line size.

I think you can properly devide you memory object taking into account all requirements. And then use this sub-buffers to process data on different devices. Everythyng should be fine.

dominik_grewe · ‎06-24-2012

You're right. Using sub-buffers is probably the cleaner way of doing this. And the alignment restriction makes sure that sub-buffers don't overlap in terms of cache lines.

Thanks for your reply!

dominik_grewe · ‎07-01-2012

Do you know if it makes a difference in performance if one uses eg USE_HOST_PTR vs creating a separate memory buffer and copying data explicitly?
On AMD's Llano and Trinity systems it seems to matter how memory is allocated (see discussion here: http://devgurus.amd.com/message/1282235). Is this the same on Ivy Bridge?

EvgeniyPeshkov · ‎07-02-2012

I'm morefamiliar with AMD technologies ))) But I'll try to answer again (=

It's depend on your task and scenario of global memory usage. According to Intel Optimization Guidebecause Intel uses true memory sharing across host and all devices you can allocate buffer by OpenCL runtime and then map/unmap it without extra copy (it is simplest way). If you need to use specific memory managment in your program you can allocate memory by yourself and then paste it when creating buffer using CL_MEM_USE_HOST_PTR flag but memory must be properly aligned,4096 bytes if i'm right. Only in this case you definitelyavoid extra memory copies.

For more information you can look at Mapping Memory Objects chapter ofIntel Optimization Guide.

dominik_grewe · ‎07-02-2012

Thanks for the reply! So it doesn't matter how memory is allocated for performance unlike on AMD's platform.