When using OpenCL’s fine grained memory on Intel platforms, you are enabling hardware coherency between CPU and GPU caches. Depending on an applications access patterns, there can be some bandwidth overhead to enabling coherency. This is most acute when data is actually shared, and it’s a contended access.
In your description, you indicate that you are not actually sharing data values. As you describe, your compute threads are strictly segregated in their uses of the shared SVM buffer, each accessing distinct regions. But “false sharing” of a cacheline between CPU and GPU can still happen, just as it happens between two CPU threads that share memory. So even though your application does not directly access the same actual data values from different compute threads, it is possible that the cachelines in which the data values live are still being shared across distinct regions. Thus you might inadvertently have some contended access to shared cachelines. Cachelines are 64 bytes on both Intel CPUs and Intel GPUs.
Try implementing using OpenCL’s Coarse Grain SVM.This may be a better fit, if I understand your usage pattern correctly. Coarse Grain does not enable hardware coherency, and updates to the SVM buffer are enforced at API call boundaries.
If using Fine Grain SVM in such a segregated manner, try aligning your distinct regions along 64B boundaries. Said differently, you want to layout your shared data structure to avoid CPU threads and GPU threads from concurrently accessing and sharing the same 64B cacheline.
**** EDIT **** it turns out my observations below are not totally accurate. I will update this thread with more accurate results after I gain more understanding of the behavior.
Thanks a lot for your reply. Actually using coarse grain SVM removed the overhead totally! So I guess that cache coherency is not enabled in case of coarse gained SVM.
Also I have just 2 notes for people who might be interested in the issue:
- When I was working with fine grain SVM, I had my regions aligned on 64B boundary like you suggested. So the overhead I reported above happens even when the alignment is taken care of.
- Currently, using coarse grained SVM, I don't even have to map/unmap the regions the CPU is working on! Despite that, every thing works perfectly fine. According to the spec, the behavior of using a region on the host without mapping (and similarly using a region on the device without unmapping in case it was mapped before) is undefined. I understood this from the last paragraph of section 5.6.1 of the spec.
So my question (conclusion) here is, is it safe to use coarse grain SVM buffer regions without mapping them to the host first if you can guarantee that the device won't write to those regions? Please correct me if I am wrong about this.
Using coarse-grained SVM changed the order in which CPU and GPU work on the shared memory object giving the exact same time (processing + overhead) as fine-grained SVM! In case of fine-grained the CPU and GPU worked at the same time on the memory object resulting in coherency overhead like Stephen noted before. In case of coarse-grained, it was almost like the CPU had to finish its work completely before the GPU was able to start working.
We ended up using sub-buffers. I created sub-buffers for the regions the CPU should work on and then mapped these sub-buffers to the host. This way more than 50% of the overhead is gone and CPU and GPU work in parallel. The 2nd paragraph of section 5.6.2 of the OpenCL 2.0 spec talks about this strategy.