- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150.
typedef struct {
char data[8];
} block;
block buf;
while(true) {
global_ptr = some_complex_address_calculation;
#pragma unroll
for(int i = 0; i < 8; i++) {
buf.data[i] = global_ptr[i];
}
I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.
I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:
1) What is the size of cache line?
2) What is the eviction policy?
3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)
Thanks in advance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please allow me some time to check on this.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The cache size depends on the memory size that you would like to access. If you are using the built-in calls that you can use for loading from and storing to global memory, you have to define using the Argument #2 of the Load Built-in. May I know which eviction policy you are referring to?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The details of the cache are not documented anywhere; however, in my experience:
1- The cache line size is equal to the size of the coalesced memory port. Moreover, by default the cache has 512 or 1024 lines (don't remember exactly since nowadays, I always disable the cache to prevent it from wasting precious Block RAMs).
2- It is probably something extremely simple like FIFO or LIFO. Best case scenario LRU.
3- I am pretty sure the cache doesn't pre-load anything.
In reality, exploiting your application's locality manually will always be more effective and efficient than relying on the extremely simple cache the compiler creates.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page