gpu cache friendly

Fu_J_Intel · ‎06-07-2016

Hi,

I wonder how gpu cache works? specifically, I have 2 Qs

- does gpu HW perform prefetch?

- in the 2 kernels below, which one is more cache efficient (or not much difference)? By cache efficient, I mean higher transfer rate from cache to EU.

kernel 1,

__kernel void foo1(_global uchar* src, __glpobal uchar* dst)

{

int i = get_global_id(0);

dst = src >>1;

}

kernel 2,

__kernel void foo2(_global uchar16* src, __glpobal uchar16* dst)

{

int i = get_global_id(0);

dst = src >> (uchar) 1;

}

Robert_I_Intel · ‎06-07-2016

Jeffrey,

The second version is much more efficient in bringing a data in and writing it out.

The first version brings only half a cacheline (SIMD32 compilation, 32 bytes out of 64 bytes)

The second version brings in 8 full cache lines - much better (actually, the best you can do on our hardware).

Fu_J_Intel · ‎06-07-2016

thanks for nice info, Robert.

More questions:

- does GPU do prefetch?

- in vtune, "L3 shader bandwidth" of kernel 1 is 28 GB/s and for kernel 2 , the bandwidth is 10GB/s. This looks opposite to my understanding.

Robert_I_Intel · ‎06-07-2016

Jeffrey,

The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.