Re: Eviction policy of burst-coalesced cached non-aligned LSU's cache

PRavi7 · ‎11-11-2019

Hi,

I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150.

typedef struct {
    char data[8];
} block;
block buf;
while(true) {
    global_ptr = some_complex_address_calculation;
    #pragma unroll
    for(int i = 0; i < 8; i++) {
        buf.data[i] = global_ptr[i];
}

I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.

I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:

1) What is the size of cache line?

2) What is the eviction policy?

3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)

Thanks in advance

KhaiChein_Y_Intel · ‎11-16-2019

Hi,

Please allow me some time to check on this.

Thanks.

KhaiChein_Y_Intel · ‎12-04-2019

Hi,

The cache size depends on the memory size that you would like to access. If you are using the built-in calls that you can use for loading from and storing to global memory, you have to define using the Argument #2 of the Load Built-in. May I know which eviction policy you are referring to?

Thanks.

HRZ · ‎12-06-2019

The details of the cache are not documented anywhere; however, in my experience:

1- The cache line size is equal to the size of the coalesced memory port. Moreover, by default the cache has 512 or 1024 lines (don't remember exactly since nowadays, I always disable the cache to prevent it from wasting precious Block RAMs).

2- It is probably something extremely simple like FIFO or LIFO. Best case scenario LRU.

3- I am pretty sure the cache doesn't pre-load anything.

In reality, exploiting your application's locality manually will always be more effective and efficient than relying on the extremely simple cache the compiler creates.