Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)

local memory bank

Altera_Forum
Honored Contributor II
1,128 Views

I have read best practice guide, but I am still confused. 

 

I have optimize the local memory to 1 read and 1 write. 

However, the report.html report that the"w_local" memory use 64 RAM blocks. 

I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock, 

but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right? 

 

 

typedef struct{ short ff; } filter_trans; typedef struct{ filter_trans ww; } data_trans; typedef struct{ filter_trans ww; } weight_trans; __kernel(){ weight_trans w_local; data_trans data_in = read_channel_intel(data_ch); cont control = read_channel_intel(cont_ch); weight_trans get_w = w_local; # pragma unroll for(int n=0; n<4; n++){ winograd = 0; # pragma unroll for(int j=0; j<16; j++){ winograd += get_w.ww.ff * data_in.ww.ff; } } }  

 

"w_local" 

Private memory: Optimal 

Requested size: 73728 bytes 

Implemented size: 131072 bytes  

Number of banks: 1 

Bank width: 1024 bits 

Bank depth: 1024 words 

Total replication: 1 

Additional information: Requested size 73728 bytes, implemented size 131072 bytes, stall-free, 1 read and 1 write.  

- See Best Practices Guide: Local Memory for more information. 

Private memory implemented in on-chip block RAM.
0 Kudos
2 Replies
Altera_Forum
Honored Contributor II
431 Views

I have also modified the code, but the RAM blocks usage become even worse larger than 64. 

 

 

typedef struct{ short ff; } filter_trans; typedef struct{ filter_trans ww; } data_trans; typedef struct{ filter_trans ww; } weight_trans; __kernel(){ short __attribute__((numbanks(16),bankwidth(2))) w_local; data_trans data_in = read_channel_intel(data_ch); cont control = read_channel_intel(cont_ch); weight_trans get_w; # pragma unroll for(int n=0; n<4; n++){ # pragma unroll for(int j=0; j<16; j++){ get.ww.ff = w_local; } } # pragma unroll for(int n=0; n<4; n++){ winograd = 0; # pragma unroll for(int j=0; j<16; j++){ winograd += get_w.ww.ff * data_in.ww.ff; } } }  

 

Private memory Optimal 

Total replication 1 

Number of banks 16 (banked on lowest dimension) 

Bank depth 16384 words 

Additional information  

Requested size 294912 bytes, implemented size 524288 bytes, stall-free, 16 reads and 16 writes.  

Banked on lowest dimension into 16 separate banks. 

Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance. 

See Best Practices Guide : Local Memory for more information. 

Implemented size 524288 bytes 

Bank width 16 bits 

Requested size 294912 bytes 

Private memory implemented in on-chip block RAM.
0 Kudos
Altera_Forum
Honored Contributor II
431 Views

 

--- Quote Start ---  

I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock, 

but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right? 

--- Quote End ---  

 

 

No, even without taking replication into account, your buffer has a size of 576 * 1024 = 589824 bits, which, considering the size of the Block RAMs being 20kb, you need at least 30 blocks just to fit the buffer. Furthermore, each Block RAM has two 32-bit ports; obviously, you cannot read 1024 bits per clock from a 32-bit port. The write port has to be connected to every Block RAM used to implement the buffer and the 1024-bit read port is split between them which requires a minimum of 32 Block RAMs to provide enough ports. Adding other overheads (address calculation, routing, etc.), the compiler ends up using 64 Block RAMs. This configuration is optimal and is unlikely to be improvable.
0 Kudos
Reply