- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I wonder how gpu cache works? specifically, I have 2 Qs
- does gpu HW perform prefetch?
- in the 2 kernels below, which one is more cache efficient (or not much difference)? By cache efficient, I mean higher transfer rate from cache to EU.
kernel 1,
__kernel void foo1(_global uchar* src, __glpobal uchar* dst)
{
int i = get_global_id(0);
dst = src >>1;
}
kernel 2,
__kernel void foo2(_global uchar16* src, __glpobal uchar16* dst)
{
int i = get_global_id(0);
dst = src >> (uchar) 1;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jeffrey,
The second version is much more efficient in bringing a data in and writing it out.
The first version brings only half a cacheline (SIMD32 compilation, 32 bytes out of 64 bytes)
The second version brings in 8 full cache lines - much better (actually, the best you can do on our hardware).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for nice info, Robert.
More questions:
- does GPU do prefetch?
- in vtune, "L3 shader bandwidth" of kernel 1 is 28 GB/s and for kernel 2 , the bandwidth is 10GB/s. This looks opposite to my understanding.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jeffrey,
The only explanation that I have for this: you are bringing more data in but utilizing only half of it. The important case here is the LLC$ to L3$ BW, which is better for the second kernel.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page