When using OpenCL’s fine

Kareem_E_ · ‎02-16-2016

I am trying to find some way to relax memory consistency imposed by OpenCL 2.0 run time. To clarify my goal suppose you have the following scenario:

You have an fine-grained SVM memory object that is to be written by the CPU and GPU at the same time...
You have some method that will launch 1 or more kernels on the GPU. Let's call this methodlaunch_kernels. All kernels launched by launch kernel will manipulate the SVM object.
You have another CPU method that will also do some processing on the data of the SVM object. Let's call cpu_process.
All GPU kernels AND the cpu_process method will calculate different regions of the SVM object.

So you can image a code scenario like this:

void* svm_obj;
allocate_svm_object(&svm_obj);
launch_kernels(svm_obj); // will return immediately without waiting for kernels to finish
cpu_process(svm_obj);
sync_gpu(); // wait for prev launched kernels to finish

Here is my situation:

When launch_kernels(svm_obj) gets called individually (i.e. remove cpu_process(svm_obj); line above), it takes about 5 ms.
When cpu_process(svm_obj); gets called individually (i.e. remove launch_kernels(svm_obj); and sync_gpu(); lines above), it takes also about 5 ms.
When they are called in parallel together (i.e. the exact scenario above) each one takes an additional time of about 3 ms for a total of 8 ms each.

I suppose this additional overhead is added by the OpenCL run time to guarantee consistency of the SVM memory object. However, in my case, I can guarantee consistency without the run time's help because no memory location is written to by more than one execution unit.

My question is, is there a way to relax memory consistency of OpenCL 2.0 so that I can remove the additional overhead?

Stephen_J_Intel · ‎02-16-2016

When using OpenCL’s fine grained memory on Intel platforms, you are enabling hardware coherency between CPU and GPU caches. Depending on an applications access patterns, there can be some bandwidth overhead to enabling coherency. This is most acute when data is actually shared, and it’s a contended access.

In your description, you indicate that you are not actually sharing data values. As you describe, your compute threads are strictly segregated in their uses of the shared SVM buffer, each accessing distinct regions. But “false sharing” of a cacheline between CPU and GPU can still happen, just as it happens between two CPU threads that share memory. So even though your application does not directly access the same actual data values from different compute threads, it is possible that the cachelines in which the data values live are still being shared across distinct regions. Thus you might inadvertently have some contended access to shared cachelines. Cachelines are 64 bytes on both Intel CPUs and Intel GPUs.

Some suggestions:

Try implementing using OpenCL’s Coarse Grain SVM.This may be a better fit, if I understand your usage pattern correctly. Coarse Grain does not enable hardware coherency, and updates to the SVM buffer are enforced at API call boundaries.
If using Fine Grain SVM in such a segregated manner, try aligning your distinct regions along 64B boundaries. Said differently, you want to layout your shared data structure to avoid CPU threads and GPU threads from concurrently accessing and sharing the same 64B cacheline.

regards, -Stephen

Kareem_E_ · ‎02-17-2016

**** EDIT **** it turns out my observations below are not totally accurate. I will update this thread with more accurate results after I gain more understanding of the behavior.

Hi Stephen,

Thanks a lot for your reply. Actually using coarse grain SVM removed the overhead totally! So I guess that cache coherency is not enabled in case of coarse gained SVM.

Also I have just 2 notes for people who might be interested in the issue:

When I was working with fine grain SVM, I had my regions aligned on 64B boundary like you suggested. So the overhead I reported above happens even when the alignment is taken care of.
Currently, using coarse grained SVM, I don't even have to map/unmap the regions the CPU is working on! Despite that, every thing works perfectly fine. According to the spec, the behavior of using a region on the host without mapping (and similarly using a region on the device without unmapping in case it was mapped before) is undefined. I understood this from the last paragraph of section 5.6.1 of the spec.

So my question (conclusion) here is, is it safe to use coarse grain SVM buffer regions without mapping them to the host first if you can guarantee that the device won't write to those regions? Please correct me if I am wrong about this.

Regards,

Kareem

Kareem_E_ · ‎03-02-2016

Hello again,

Using coarse-grained SVM changed the order in which CPU and GPU work on the shared memory object giving the exact same time (processing + overhead) as fine-grained SVM! In case of fine-grained the CPU and GPU worked at the same time on the memory object resulting in coherency overhead like Stephen noted before. In case of coarse-grained, it was almost like the CPU had to finish its work completely before the GPU was able to start working.

We ended up using sub-buffers. I created sub-buffers for the regions the CPU should work on and then mapped these sub-buffers to the host. This way more than 50% of the overhead is gone and CPU and GPU work in parallel. The 2nd paragraph of section 5.6.2 of the OpenCL 2.0 spec talks about this strategy.

Relaxing SVM memory consistency in OpenCL 2.0