Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mvemp
Novice
2,240 Views

Does the local memory usage increase with banking and coalescing?

Does the number of M20K blocks increase when compiler automatically banks or increases the bank width?

 

Eg:-

 

int sample[4] [32];

 

int a[4][32];

 

int b[4][32];

 

#pragma unroll

for(j = 0; j < 4; j++) {

#pragma unroll

for (i = 0; i < 32; i++) {

 

sample[j][i] = a[j][i] + b[j][i];

}

}

 

In one of my programs which had a similar snippet as above, the M20K usage suddenly boosted up. The number of RAMs allocated are 103 which is quite strange. Happens even for a and b variables.

 

The area report shows no reason why the number of RAM blocks increased.

 

Requested size 512 bytes

Implemented size 512 bytes

Private memory Optimal

Total replication 1

Number of banks 1

Bank width 4096 bits

Bank depth 1 word

Additional information Requested size 512 bytes, implemented size 512 bytes, stall-free, 1 read and 1 write. 

Reference See Best Practices Guide : Local Memory for more information.

 

 

Is the increase in RAM blocks due to the type of memory access?

0 Kudos
5 Replies
HRZ
Valued Contributor II
36 Views

The number of M20K blocks used indeed depends on the number and width of accesses to the local buffer. However, all of this info will be reflected in the report. If the report says the implemented size is 512 bytes but 103 blocks are allocated, these blocks are likely being used in the other parts of the circuit. e.g. as buffers between the kernel and the memory interface, as on-chip cache for global memory accesses, as FIFOs in the pipeline or to implement channels, etc. If you check the "Area analysis by source" part of the area report, you can get a detailed break-down of where each resource is being used. If you post your full kernel code, I can generate the report and give you more details.

mvemp
Novice
36 Views

#define DIM_3 2 #define DIM_4 8   #pragma OPENCL EXTENSION cl_intel_channels : enable     typedef char QTYPE; typedef int HTYPE;     typedef struct { QTYPE data[DIM_3]; } group_data;     typedef struct { group_data lane[DIM_4]; } group_vec;     typedef struct { QTYPE lane[DIM_4]; } group_ch;         channel group_vec data_ch __attribute__((depth(0))); channel group_vec weight_ch __attribute__((depth(0))); channel group_ch out_ch __attribute__((depth(0)));   __kernel __attribute__((task)) __attribute__((max_global_work_dim(0))) void fetch_data(   __global group_data *restrict bottom   )   { group_data data_vec; group_vec data_ch_out;       for(unsigned int win_itm_xyz=0; win_itm_xyz< 39 * 39 * 4096/(DIM_3); win_itm_xyz++){ data_vec = bottom[win_itm_xyz]; #pragma unroll for(unsigned char ll=0; ll<DIM_4; ll++){ data_ch_out.lane[ll] = data_vec; } write_channel_intel(data_ch, data_ch_out);     } }   __kernel __attribute__((task)) __attribute__((max_global_work_dim(0))) void fetch_weights(   __global volatile group_vec *restrict weights )   { group_vec weight_vec; for(unsigned int win_itm_xyz=0; win_itm_xyz< 39 * 39 * 4096/(DIM_3); win_itm_xyz++){ weight_vec = weights[win_itm_xyz]; write_channel_intel(weight_ch, weight_vec); }   }     __kernel __attribute__((task)) __attribute__((max_global_work_dim(0))) void conv_wino(   ) {     group_vec data_vec; group_vec weight_vec; group_ch convout; HTYPE conv_out[169][DIM_4]; group_ch inv_wino_out[4]; uint array_index;   for(uint output = 0; output < 39 * 39; output++) { for(unsigned int win_itm_xyz=0; win_itm_xyz< 4096/DIM_3; win_itm_xyz++){ data_vec = read_channel_intel(data_ch); weight_vec = read_channel_intel(weight_ch); #pragma unroll for(uint i = 0; i < DIM_4; i++) { #pragma unroll for(uint j = 0; j< DIM_3; j++) { convout.lane[i] += data_vec.lane[i].data[j] * weight_vec.lane[i].data[j]; } } } #pragma unroll for(unsigned char ll_t=0; ll_t<DIM_4; ll_t++){ conv_out[array_index][ll_t] = convout.lane[ll_t]; } if (array_index == 169 - 1){ array_index = 0; } else array_index++;   }   #pragma unroll for(unsigned char ll_t=0; ll_t<DIM_4; ll_t++){   inv_wino_out[0].lane[ll_t] = conv_out[0][ll_t] + conv_out[1][ll_t] + conv_out[2][ll_t] + conv_out[1][ll_t] + conv_out[5][ll_t] + conv_out[9][ll_t] + conv_out[2][ll_t] + conv_out[6][ll_t] + conv_out[10][ll_t]; //printf("\n %d ", inv_win_out[0][ll_t]); inv_wino_out[1].lane[ll_t] = conv_out[0][ll_t] + conv_out[5][ll_t] + conv_out[9][ll_t] - conv_out[2][ll_t] - conv_out[6][ll_t] - conv_out[10][ll_t] - conv_out[3][ll_t] - conv_out[7][ll_t] - conv_out[11][ll_t]; //printf("\n %d ", inv_win_out[1][ll_t]); inv_wino_out[2].lane[ll_t] = conv_out[4][ll_t] + conv_out[9][ll_t] - conv_out[12][ll_t] + conv_out[5][ll_t] - conv_out[157][ll_t] - conv_out[13][ll_t] + conv_out[6][ll_t] - conv_out[10][ll_t] - conv_out[14][ll_t]; //printf("\n %d ", inv_win_out[2][ll_t]); inv_wino_out[3].lane[ll_t] = conv_out[5][ll_t] - conv_out[16][ll_t] - conv_out[13][ll_t] - conv_out[6][ll_t] + conv_out[10][ll_t] + conv_out[14][ll_t] - conv_out[7][ll_t] + conv_out[11][ll_t] + conv_out[15][ll_t]; //printf("\n %d ", inv_win_out[3][ll_t]); }   for(unsigned char ll_t=0; ll_t<4; ll_t++) { write_channel_intel(out_ch, inv_wino_out[ll_t]);   }   }       // Store Data to Global Memory __kernel __attribute__((task)) __attribute__((max_global_work_dim(0))) void WriteBack(   __global group_ch *restrict top ) {   uint array_index;   uchar index_z_item; // max value 256 ushort index_z_group;// max value 4096   group_ch output;       for(uint dd = 0; dd< 4; dd++){ output = read_channel_intel(out_ch); top[dd] = output; //printf("\n index: %d, Output buffer : %d ", dd, output.lane[ll] ); } }                  

In the code snippet, the local memory allocated to conv_out = 103. I checked area analysis by source. But no line is mentioned about 103 RAM blocks.

HRZ
Valued Contributor II
36 Views

I am surprised the report is not correctly reflecting the implemented size of the buffer. Anyway, based on the report, the bank width is 2048 bits, while the maximum width of the Block RAM ports is 40 bits. This means that a replication factor of at least 52 is required to provide enough ports to implement the buffer. Furthermore, each instance of the buffer is 8192 bytes which requires 3-4 Block RAMs (depending on the depth) to implement. Since the Block RAMs are double-pumped, the number of required Block RAMs will then be halved. I think 103 Block RAMs in the end is a reasonable number.

mvemp
Novice
36 Views

Can I conclude every bank get mapped to 1 M20K and every 40bits bankwidth gets mapped to 1M20K?

HRZ
Valued Contributor II
36 Views

Actually I am not sure if the compiler configures the Block RAMs with a width of 32 bits or 40 bits. Either way, you should also take the size of the buffer into account (the size itself might require more Block RAMs than the minimum number that is needed to provide the necessary number of ports for all accesses). Moreover, the type of accesses also matter. Writes need to be connected to all buffer replicas, while reads need to be connected to one. with double-pumping you effectively get 4 ports per Block RAM. e.g, for 5 reads and 1 write, you need 2 Block RAMs per every 32-bit (or 40-bit) bank width, but for 5 reads and 2 writes you will need 3. To be honest, accurate prediction of Block RAM usage is not very straightforward.

Reply