Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Employee
10 Views

gpu cache friendly

Hi,

I wonder how gpu cache works? specifically, I have 2 Qs

- does gpu HW perform prefetch?

- in the 2 kernels below, which one is more cache efficient (or not much difference)? By cache efficient, I mean higher transfer rate from cache to EU.

kernel 1,

__kernel void foo1(_global uchar* src, __glpobal uchar* dst) 

{

         int i = get_global_id(0);

        dst = src >>1;

}

 

kernel 2,

__kernel void foo2(_global uchar16* src, __glpobal uchar16* dst) 

{

         int i = get_global_id(0);

        dst = src >> (uchar) 1;

}

 

0 Kudos
3 Replies
Highlighted
Employee
10 Views

Jeffrey,

The second version is much more efficient in bringing a data in and writing it out.

The first version brings only half a cacheline (SIMD32 compilation, 32 bytes out of 64 bytes)

The second version brings in 8 full cache lines - much better (actually, the best you can do on our hardware).

0 Kudos
Highlighted
Employee
10 Views

thanks for nice info, Robert.

More questions:

- does GPU do prefetch?

- in vtune, "L3 shader bandwidth" of kernel 1 is 28 GB/s and for kernel 2 , the bandwidth is 10GB/s.   This looks opposite to my understanding.

0 Kudos
Highlighted
Employee
10 Views

Jeffrey,

The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.

0 Kudos