local memory bank

Altera_Forum · ‎05-25-2018

I have read best practice guide, but I am still confused.

I have optimize the local memory to 1 read and 1 write.

However, the report.html report that the"w_local" memory use 64 RAM blocks.

I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock,

but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right?


typedef struct{
    short ff;
} filter_trans;
typedef struct{
    filter_trans ww;
} data_trans;
typedef struct{
    filter_trans ww;
} weight_trans;
__kernel(){
      weight_trans w_local;
      data_trans data_in = read_channel_intel(data_ch);
      cont control = read_channel_intel(cont_ch);
      weight_trans get_w = w_local;
     # pragma unroll      
       for(int n=0; n<4; n++){
            winograd = 0;
           # pragma unroll
            for(int j=0; j<16; j++){
                winograd += get_w.ww.ff * data_in.ww.ff;
            }
      }
}

"w_local"

Private memory: Optimal

Requested size: 73728 bytes

Implemented size: 131072 bytes

Number of banks: 1

Bank width: 1024 bits

Bank depth: 1024 words

Total replication: 1

Additional information: Requested size 73728 bytes, implemented size 131072 bytes, stall-free, 1 read and 1 write.

- See Best Practices Guide: Local Memory for more information.

Private memory implemented in on-chip block RAM.

Altera_Forum · ‎05-25-2018

I have also modified the code, but the RAM blocks usage become even worse larger than 64.


typedef struct{
    short ff;
} filter_trans;
typedef struct{
    filter_trans ww;
} data_trans;
typedef struct{
    filter_trans ww;
} weight_trans;
__kernel(){
      short __attribute__((numbanks(16),bankwidth(2))) w_local;
      data_trans data_in = read_channel_intel(data_ch);
      cont control = read_channel_intel(cont_ch);
      weight_trans get_w;
     # pragma unroll      
       for(int n=0; n<4; n++){
           # pragma unroll
            for(int j=0; j<16; j++){
                get.ww.ff = w_local;
            }
      }
      
     # pragma unroll      
       for(int n=0; n<4; n++){
            winograd = 0;
           # pragma unroll
            for(int j=0; j<16; j++){
                winograd += get_w.ww.ff * data_in.ww.ff;
            }
      }
}

Private memory Optimal

Total replication 1

Number of banks 16 (banked on lowest dimension)

Bank depth 16384 words

Additional information

Requested size 294912 bytes, implemented size 524288 bytes, stall-free, 16 reads and 16 writes.

Banked on lowest dimension into 16 separate banks.

Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.

See Best Practices Guide : Local Memory for more information.

Implemented size 524288 bytes

Bank width 16 bits

Requested size 294912 bytes

Private memory implemented in on-chip block RAM.

Altera_Forum · ‎05-25-2018

--- Quote Start ---

I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock,

but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right?

--- Quote End ---

No, even without taking replication into account, your buffer has a size of 576 * 1024 = 589824 bits, which, considering the size of the Block RAMs being 20kb, you need at least 30 blocks just to fit the buffer. Furthermore, each Block RAM has two 32-bit ports; obviously, you cannot read 1024 bits per clock from a 32-bit port. The write port has to be connected to every Block RAM used to implement the buffer and the 1024-bit read port is split between them which requires a minimum of 32 Block RAMs to provide enough ports. Adding other overheads (address calculation, routing, etc.), the compiler ends up using 64 Block RAMs. This configuration is optimal and is unlikely to be improvable.