Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15329 Discussions

undesired BRAM replication

Honored Contributor II

I am using a single work item kernel to do matrix multiplication, and my BRAM usage explored (estimated 100+% BRAM usage while only 16% for DSP). 



#define MAT_A_ROWS 128  

#define MAT_A_COLS 64 


#define MAT_B_COLS 128 


#define BLOCK_SIZE 16 



void matrix_mult(__global float *restrict matA, __global float *restrict matB, __global float *restrict matC, ) { 


__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(1))) A[BLOCK_SIZE][BLOCK_SIZE]; 

__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(BLOCK_SIZE))) B[BLOCK_SIZE][BLOCK_SIZE]; 


for (int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) { 



for(int i = 0; i < BLOCK_SIZE; i++) { 

for(int j = 0; j < BLOCK_SIZE; j++) { 

A[j] = mata[.....]; 


for(int i = 0; i < block_size; i++) { 

for(int j = 0; j < block_size; j++) { 

b[j] = matB[.....]; 


for(int i = 0; i < BLOCK_SIZE; i++){ 

for(int j = 0; j < BLOCK_SIZE; j++) { 

float running_sum = 0; 

for(int k = 0; k < BLOCK_SIZE; k ++) { 

running_sum += A[k] + b[k][j]; 

c[j] += running_sum; 



//write to C to matC 


According to the report, there 8 threads being pipelined from loop "for(int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) ", thus my A and B are replicated 7 times. Is it possible to prevent the memory being replicated? 


Any advice would be greatly appreciated! 

Lancer Chiang
0 Kudos
5 Replies
Honored Contributor II

Please post your full compilation report or full kernel. Assuming that this is the type of replication the compiler performs to allow a certain number of parallel iterations, you can control this level of parallelism by using the "#pragma max_concurrency" before the loop.

Honored Contributor II

Hi HRZ, thanks for your reply! That's a good advice! (I did not know I can add this haha)

Honored Contributor II

Hi HRZ, here is the report of one of my local memory: (data): 

Local memory: Potentially inefficient configuration. 

Requested size 65536 bytes (rounded up to nearest power of 2), implemented size 458752 bytes, replicated 7 times total, stallable, 64 reads and 1 write. Additional information: 

- Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. 

- Replicated 7 times to create private copies for simultaneous execution of 7 threads in the loop containing accesses to the array. 

- Banked on lowest dimension into 64 separate banks (this is a good thing). 


I don't understand what's the seven threads.
Honored Contributor II

Latency of accesses to multi-ported on-chip buffers is not one cycle; hence, the compiler has to further replicate the buffer that is accessed in the loop so that loop iterations in-flight in the pipeline can access different copies of the same buffer in parallel, resulting in correct full-pipelining and an initiation interval of one. If the "#pragma max_concurrency" I mentioned above does not reduce this replication factor (e.g.# pragma max_concurrency 2), then using# pragma II might (e.g.# pragma II 3). Note that all of these come at the cost of lower performance, probably MUCH lower performance.

Honored Contributor II

Hi HRZ, I have some problems I could not solve in my design for a very long time, could you please help me a bit?