- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does the number of M20K blocks increase when compiler automatically banks or increases the bank width?
Eg:-
int sample[4] [32];
int a[4][32];
int b[4][32];
#pragma unroll
for(j = 0; j < 4; j++) {
#pragma unroll
for (i = 0; i < 32; i++) {
sample[j][i] = a[j][i] + b[j][i];
}
}
In one of my programs which had a similar snippet as above, the M20K usage suddenly boosted up. The number of RAMs allocated are 103 which is quite strange. Happens even for a and b variables.
The area report shows no reason why the number of RAM blocks increased.
Requested size 512 bytes
Implemented size 512 bytes
Private memory Optimal
Total replication 1
Number of banks 1
Bank width 4096 bits
Bank depth 1 word
Additional information Requested size 512 bytes, implemented size 512 bytes, stall-free, 1 read and 1 write.
Reference See Best Practices Guide : Local Memory for more information.
Is the increase in RAM blocks due to the type of memory access?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The number of M20K blocks used indeed depends on the number and width of accesses to the local buffer. However, all of this info will be reflected in the report. If the report says the implemented size is 512 bytes but 103 blocks are allocated, these blocks are likely being used in the other parts of the circuit. e.g. as buffers between the kernel and the memory interface, as on-chip cache for global memory accesses, as FIFOs in the pipeline or to implement channels, etc. If you check the "Area analysis by source" part of the area report, you can get a detailed break-down of where each resource is being used. If you post your full kernel code, I can generate the report and give you more details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#define DIM_3 2
#define DIM_4 8
#pragma OPENCL EXTENSION cl_intel_channels : enable
typedef char QTYPE;
typedef int HTYPE;
typedef struct {
QTYPE data[DIM_3];
} group_data;
typedef struct {
group_data lane[DIM_4];
} group_vec;
typedef struct {
QTYPE lane[DIM_4];
} group_ch;
channel group_vec data_ch __attribute__((depth(0)));
channel group_vec weight_ch __attribute__((depth(0)));
channel group_ch out_ch __attribute__((depth(0)));
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void fetch_data(
__global group_data *restrict bottom
)
{
group_data data_vec;
group_vec data_ch_out;
for(unsigned int win_itm_xyz=0; win_itm_xyz< 39 * 39 * 4096/(DIM_3); win_itm_xyz++){
data_vec = bottom[win_itm_xyz];
#pragma unroll
for(unsigned char ll=0; ll<DIM_4; ll++){
data_ch_out.lane[ll] = data_vec;
}
write_channel_intel(data_ch, data_ch_out);
}
}
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void fetch_weights(
__global volatile group_vec *restrict weights
)
{
group_vec weight_vec;
for(unsigned int win_itm_xyz=0; win_itm_xyz< 39 * 39 * 4096/(DIM_3); win_itm_xyz++){
weight_vec = weights[win_itm_xyz];
write_channel_intel(weight_ch, weight_vec);
}
}
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void conv_wino(
)
{
group_vec data_vec;
group_vec weight_vec;
group_ch convout;
HTYPE conv_out[169][DIM_4];
group_ch inv_wino_out[4];
uint array_index;
for(uint output = 0; output < 39 * 39; output++) {
for(unsigned int win_itm_xyz=0; win_itm_xyz< 4096/DIM_3; win_itm_xyz++){
data_vec = read_channel_intel(data_ch);
weight_vec = read_channel_intel(weight_ch);
#pragma unroll
for(uint i = 0; i < DIM_4; i++) {
#pragma unroll
for(uint j = 0; j< DIM_3; j++) {
convout.lane[i] += data_vec.lane[i].data[j] * weight_vec.lane[i].data[j];
}
}
}
#pragma unroll
for(unsigned char ll_t=0; ll_t<DIM_4; ll_t++){
conv_out[array_index][ll_t] = convout.lane[ll_t];
}
if (array_index == 169 - 1){
array_index = 0;
}
else
array_index++;
}
#pragma unroll
for(unsigned char ll_t=0; ll_t<DIM_4; ll_t++){
inv_wino_out[0].lane[ll_t] = conv_out[0][ll_t] + conv_out[1][ll_t] + conv_out[2][ll_t] + conv_out[1][ll_t] + conv_out[5][ll_t] + conv_out[9][ll_t] + conv_out[2][ll_t] + conv_out[6][ll_t] + conv_out[10][ll_t];
//printf("\n %d ", inv_win_out[0][ll_t]);
inv_wino_out[1].lane[ll_t] = conv_out[0][ll_t] + conv_out[5][ll_t] + conv_out[9][ll_t] - conv_out[2][ll_t] - conv_out[6][ll_t] - conv_out[10][ll_t] - conv_out[3][ll_t] - conv_out[7][ll_t] - conv_out[11][ll_t];
//printf("\n %d ", inv_win_out[1][ll_t]);
inv_wino_out[2].lane[ll_t] = conv_out[4][ll_t] + conv_out[9][ll_t] - conv_out[12][ll_t] + conv_out[5][ll_t] - conv_out[157][ll_t] - conv_out[13][ll_t] + conv_out[6][ll_t] - conv_out[10][ll_t] - conv_out[14][ll_t];
//printf("\n %d ", inv_win_out[2][ll_t]);
inv_wino_out[3].lane[ll_t] = conv_out[5][ll_t] - conv_out[16][ll_t] - conv_out[13][ll_t] - conv_out[6][ll_t] + conv_out[10][ll_t] + conv_out[14][ll_t] - conv_out[7][ll_t] + conv_out[11][ll_t] + conv_out[15][ll_t];
//printf("\n %d ", inv_win_out[3][ll_t]);
}
for(unsigned char ll_t=0; ll_t<4; ll_t++)
{
write_channel_intel(out_ch, inv_wino_out[ll_t]);
}
}
// Store Data to Global Memory
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void WriteBack(
__global group_ch *restrict top
)
{
uint array_index;
uchar index_z_item; // max value 256
ushort index_z_group;// max value 4096
group_ch output;
for(uint dd = 0; dd< 4; dd++){
output = read_channel_intel(out_ch);
top[dd] = output;
//printf("\n index: %d, Output buffer : %d ", dd, output.lane[ll] );
}
}
In the code snippet, the local memory allocated to conv_out = 103. I checked area analysis by source. But no line is mentioned about 103 RAM blocks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am surprised the report is not correctly reflecting the implemented size of the buffer. Anyway, based on the report, the bank width is 2048 bits, while the maximum width of the Block RAM ports is 40 bits. This means that a replication factor of at least 52 is required to provide enough ports to implement the buffer. Furthermore, each instance of the buffer is 8192 bytes which requires 3-4 Block RAMs (depending on the depth) to implement. Since the Block RAMs are double-pumped, the number of required Block RAMs will then be halved. I think 103 Block RAMs in the end is a reasonable number.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can I conclude every bank get mapped to 1 M20K and every 40bits bankwidth gets mapped to 1M20K?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually I am not sure if the compiler configures the Block RAMs with a width of 32 bits or 40 bits. Either way, you should also take the size of the buffer into account (the size itself might require more Block RAMs than the minimum number that is needed to provide the necessary number of ports for all accesses). Moreover, the type of accesses also matter. Writes need to be connected to all buffer replicas, while reads need to be connected to one. with double-pumping you effectively get 4 ports per Block RAM. e.g, for 5 reads and 1 write, you need 2 Block RAMs per every 32-bit (or 40-bit) bank width, but for 5 reads and 2 writes you will need 3. To be honest, accurate prediction of Block RAM usage is not very straightforward.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page