Showing results for 
Search instead for 
Did you mean: 
Honored Contributor I

undesired BRAM replication

I am using a single work item kernel to do matrix multiplication, and my BRAM usage explored (estimated 100+% BRAM usage while only 16% for DSP). 



#define MAT_A_ROWS 128  

#define MAT_A_COLS 64 


#define MAT_B_COLS 128 


#define BLOCK_SIZE 16 



void matrix_mult(__global float *restrict matA, __global float *restrict matB, __global float *restrict matC, ) { 


__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(1))) A[BLOCK_SIZE][BLOCK_SIZE]; 

__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(BLOCK_SIZE))) B[BLOCK_SIZE][BLOCK_SIZE]; 


for (int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) { 



for(int i = 0; i < BLOCK_SIZE; i++) { 

for(int j = 0; j < BLOCK_SIZE; j++) { 

A[j] = mata[.....]; 


for(int i = 0; i < block_size; i++) { 

for(int j = 0; j < block_size; j++) { 

b[j] = matB[.....]; 


for(int i = 0; i < BLOCK_SIZE; i++){ 

for(int j = 0; j < BLOCK_SIZE; j++) { 

float running_sum = 0; 

for(int k = 0; k < BLOCK_SIZE; k ++) { 

running_sum += A[k] + b[k][j]; 

c[j] += running_sum; 



//write to C to matC 


According to the report, there 8 threads being pipelined from loop "for(int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) ", thus my A and B are replicated 7 times. Is it possible to prevent the memory being replicated? 


Any advice would be greatly appreciated! 

Lancer Chiang
0 Kudos
5 Replies
Honored Contributor I

Please post your full compilation report or full kernel. Assuming that this is the type of replication the compiler performs to allow a certain number of parallel iterations, you can control this level of parallelism by using the "#pragma max_concurrency" before the loop.

Honored Contributor I

Hi HRZ, thanks for your reply! That's a good advice! (I did not know I can add this haha)

Honored Contributor I

Hi HRZ, here is the report of one of my local memory: (data): 

Local memory: Potentially inefficient configuration. 

Requested size 65536 bytes (rounded up to nearest power of 2), implemented size 458752 bytes, replicated 7 times total, stallable, 64 reads and 1 write. Additional information: 

- Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. 

- Replicated 7 times to create private copies for simultaneous execution of 7 threads in the loop containing accesses to the array. 

- Banked on lowest dimension into 64 separate banks (this is a good thing). 


I don't understand what's the seven threads.
Honored Contributor I

Latency of accesses to multi-ported on-chip buffers is not one cycle; hence, the compiler has to further replicate the buffer that is accessed in the loop so that loop iterations in-flight in the pipeline can access different copies of the same buffer in parallel, resulting in correct full-pipelining and an initiation interval of one. If the "#pragma max_concurrency" I mentioned above does not reduce this replication factor (e.g.# pragma max_concurrency 2), then using# pragma II might (e.g.# pragma II 3). Note that all of these come at the cost of lower performance, probably MUCH lower performance.

Honored Contributor I

Hi HRZ, I have some problems I could not solve in my design for a very long time, could you please help me a bit?