OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

gpu cache friendly

Fu_J_Intel
Employee
511 Views

Hi,

I wonder how gpu cache works? specifically, I have 2 Qs

- does gpu HW perform prefetch?

- in the 2 kernels below, which one is more cache efficient (or not much difference)? By cache efficient, I mean higher transfer rate from cache to EU.

kernel 1,

__kernel void foo1(_global uchar* src, __glpobal uchar* dst) 

{

         int i = get_global_id(0);

        dst = src >>1;

}

 

kernel 2,

__kernel void foo2(_global uchar16* src, __glpobal uchar16* dst) 

{

         int i = get_global_id(0);

        dst = src >> (uchar) 1;

}

 

0 Kudos
3 Replies
Robert_I_Intel
Employee
511 Views

Jeffrey,

The second version is much more efficient in bringing a data in and writing it out.

The first version brings only half a cacheline (SIMD32 compilation, 32 bytes out of 64 bytes)

The second version brings in 8 full cache lines - much better (actually, the best you can do on our hardware).

0 Kudos
Fu_J_Intel
Employee
511 Views

thanks for nice info, Robert.

More questions:

- does GPU do prefetch?

- in vtune, "L3 shader bandwidth" of kernel 1 is 28 GB/s and for kernel 2 , the bandwidth is 10GB/s.   This looks opposite to my understanding.

0 Kudos
Robert_I_Intel
Employee
511 Views

Jeffrey,

The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.

0 Kudos
Reply