- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have read best practice guide, but I am still confused.
I have optimize the local memory to 1 read and 1 write. However, the report.html report that the"w_local" memory use 64 RAM blocks. I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock, but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right?
typedef struct{
short ff;
} filter_trans;
typedef struct{
filter_trans ww;
} data_trans;
typedef struct{
filter_trans ww;
} weight_trans;
__kernel(){
weight_trans w_local;
data_trans data_in = read_channel_intel(data_ch);
cont control = read_channel_intel(cont_ch);
weight_trans get_w = w_local;
# pragma unroll
for(int n=0; n<4; n++){
winograd = 0;
# pragma unroll
for(int j=0; j<16; j++){
winograd += get_w.ww.ff * data_in.ww.ff;
}
}
}
"w_local" Private memory: Optimal Requested size: 73728 bytes Implemented size: 131072 bytes Number of banks: 1 Bank width: 1024 bits Bank depth: 1024 words Total replication: 1 Additional information: Requested size 73728 bytes, implemented size 131072 bytes, stall-free, 1 read and 1 write. - See Best Practices Guide: Local Memory for more information. Private memory implemented in on-chip block RAM.
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have also modified the code, but the RAM blocks usage become even worse larger than 64.
typedef struct{
short ff;
} filter_trans;
typedef struct{
filter_trans ww;
} data_trans;
typedef struct{
filter_trans ww;
} weight_trans;
__kernel(){
short __attribute__((numbanks(16),bankwidth(2))) w_local;
data_trans data_in = read_channel_intel(data_ch);
cont control = read_channel_intel(cont_ch);
weight_trans get_w;
# pragma unroll
for(int n=0; n<4; n++){
# pragma unroll
for(int j=0; j<16; j++){
get.ww.ff = w_local;
}
}
# pragma unroll
for(int n=0; n<4; n++){
winograd = 0;
# pragma unroll
for(int j=0; j<16; j++){
winograd += get_w.ww.ff * data_in.ww.ff;
}
}
}
Private memory Optimal Total replication 1 Number of banks 16 (banked on lowest dimension) Bank depth 16384 words Additional information Requested size 294912 bytes, implemented size 524288 bytes, stall-free, 16 reads and 16 writes. Banked on lowest dimension into 16 separate banks. Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance. See Best Practices Guide : Local Memory for more information. Implemented size 524288 bytes Bank width 16 bits Requested size 294912 bytes Private memory implemented in on-chip block RAM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock, but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right? --- Quote End --- No, even without taking replication into account, your buffer has a size of 576 * 1024 = 589824 bits, which, considering the size of the Block RAMs being 20kb, you need at least 30 blocks just to fit the buffer. Furthermore, each Block RAM has two 32-bit ports; obviously, you cannot read 1024 bits per clock from a 32-bit port. The write port has to be connected to every Block RAM used to implement the buffer and the 1024-bit read port is split between them which requires a minimum of 32 Block RAMs to provide enough ports. Adding other overheads (address calculation, routing, etc.), the compiler ends up using 64 Block RAMs. This configuration is optimal and is unlikely to be improvable.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page