Eviction policy of burst-coalesced cached non-aligned LSU's cache


I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150.

typedef struct { char data[8]; } block; block buf; while(true) { global_ptr = some_complex_address_calculation; #pragma unroll for(int i = 0; i < 8; i++) {[i] = global_ptr[i]; }

I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.

I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:

1) What is the size of cache line?

2) What is the eviction policy?

3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)


Thanks in advance

Please allow me some time to check on this.





The cache size depends on the memory size that you would like to access. If you are using the built-in calls that you can use for loading from and storing to global memory, you have to define using the Argument #2 of the Load Built-in. May I know which eviction policy you are referring to?




The details of the cache are not documented anywhere; however, in my experience:


1- The cache line size is equal to the size of the coalesced memory port. Moreover, by default the cache has 512 or 1024 lines (don't remember exactly since nowadays, I always disable the cache to prevent it from wasting precious Block RAMs).

2- It is probably something extremely simple like FIFO or LIFO. Best case scenario LRU.

3- I am pretty sure the cache doesn't pre-load anything.


In reality, exploiting your application's locality manually will always be more effective and efficient than relying on the extremely simple cache the compiler creates.