I have a local memory declared in my kernel code, which is of type int4 and declared as following
local int4 E
My single work item kernel code accesses this array with 3 reads and 3 writes within a loop with static memory access on the lower dimension. As expected, the compiler banked the memory with 2 banks for parallel access on the lower dimension and double clocked the memory for more read and write ports.
As such the memory is small and as you can see it takes only 512 bytes for implementation. The compiler report in the memory details confirms with a 512 byte implementation of the memory.
However, I'm not able to see how the block RAMs were calculated for the same.
I have attached the memory reports generated by the compiler for this memory, and as you can see the RAMs consumed by this memory is 14 blocks, which is way more than expected.
This was implemented on an arria10 soc development board with M20k RAMs. Accordingly the amount of memory with one M20k RAM comes as 32*512/8 = 2048 bytes. So only one M20k RAM block should have been enough for its implementation
Why does the compiler allot this memory 14 RAM blocks even though it mentions in the report that 512 bytes requested, 512 bytes implemented?
- OpenCL FPGA