topic Jeffrey, in OpenCL* for CPU

gpu cache friendly

Fu_J_Intel — Tue, 07 Jun 2016 16:00:23 GMT

Hi,

I wonder how gpu cache works? specifically, I have 2 Qs

- does gpu HW perform prefetch?

- in the 2 kernels below, which one is more cache efficient (or not much difference)? By cache efficient, I mean higher transfer rate from cache to EU.

kernel 1,

__kernel void foo1(_global uchar* src, __glpobal uchar* dst)

{

int i = get_global_id(0);

dst = src >>1;

}

kernel 2,

__kernel void foo2(_global uchar16* src, __glpobal uchar16* dst)

{

int i = get_global_id(0);

dst = src >> (uchar) 1;

}

Jeffrey,

Robert_I_Intel — Tue, 07 Jun 2016 18:50:31 GMT

Jeffrey,

The second version is much more efficient in bringing a data in and writing it out.

The first version brings only half a cacheline (SIMD32 compilation, 32 bytes out of 64 bytes)

The second version brings in 8 full cache lines - much better (actually, the best you can do on our hardware).

thanks for nice info, Robert.

Fu_J_Intel — Tue, 07 Jun 2016 20:46:59 GMT

thanks for nice info, Robert.

Jeffrey,

Robert_I_Intel — Tue, 07 Jun 2016 22:32:37 GMT

Jeffrey,

The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.