Application Acceleration With FPGAs
Programmable Acceleration Cards (PACs), DCP, FPGA AI Suite, Software Stack, and Reference Designs
487 Discussions

Eviction policy of burst-coalesced cached non-aligned LSU's cache

PRavi7
Beginner
891 Views

Hi,

I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150.

typedef struct { char data[8]; } block; block buf; while(true) { global_ptr = some_complex_address_calculation; #pragma unroll for(int i = 0; i < 8; i++) { buf.data[i] = global_ptr[i]; }

I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.

I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:

1) What is the size of cache line?

2) What is the eviction policy?

3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)

 

Thanks in advance

0 Kudos
3 Replies
KhaiChein_Y_Intel
661 Views

Hi,

 

Please allow me some time to check on this.

 

Thanks.

0 Kudos
KhaiChein_Y_Intel
661 Views

Hi,

The cache size depends on the memory size that you would like to access. If you are using the built-in calls that you can use for loading from and storing to global memory, you have to define using the Argument #2 of the Load Built-in. May I know which eviction policy you are referring to?

 

Thanks.

 

0 Kudos
HRZ
Valued Contributor III
661 Views

The details of the cache are not documented anywhere; however, in my experience:

 

1- The cache line size is equal to the size of the coalesced memory port. Moreover, by default the cache has 512 or 1024 lines (don't remember exactly since nowadays, I always disable the cache to prevent it from wasting precious Block RAMs).

2- It is probably something extremely simple like FIFO or LIFO. Best case scenario LRU.

3- I am pretty sure the cache doesn't pre-load anything.

 

In reality, exploiting your application's locality manually will always be more effective and efficient than relying on the extremely simple cache the compiler creates.

0 Kudos
Reply