Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16592 Discussions

undesired BRAM replication

Altera_Forum
Honored Contributor II
2,221 Views

I am using a single work item kernel to do matrix multiplication, and my BRAM usage explored (estimated 100+% BRAM usage while only 16% for DSP). 

 

============================================================================================== 

#define MAT_A_ROWS 128  

#define MAT_A_COLS 64 

#define MAT_B_ROWS MAT_A_COLS 

#define MAT_B_COLS 128 

 

#define BLOCK_SIZE 16 

__kernel 

__attribute__((task)) 

void matrix_mult(__global float *restrict matA, __global float *restrict matB, __global float *restrict matC, ) { 

 

__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(1))) A[BLOCK_SIZE][BLOCK_SIZE]; 

__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(BLOCK_SIZE))) B[BLOCK_SIZE][BLOCK_SIZE]; 

__local C[BLOCK_SIZE][BLOCK_SIZE]; 

for (int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) { 

 

 

for(int i = 0; i < BLOCK_SIZE; i++) { 

for(int j = 0; j < BLOCK_SIZE; j++) { 

A[j] = mata[.....]; 

 

for(int i = 0; i < block_size; i++) { 

for(int j = 0; j < block_size; j++) { 

b[j] = matB[.....]; 

 

for(int i = 0; i < BLOCK_SIZE; i++){ 

for(int j = 0; j < BLOCK_SIZE; j++) { 

float running_sum = 0; 

for(int k = 0; k < BLOCK_SIZE; k ++) { 

running_sum += A[k] + b[k][j]; 

c[j] += running_sum; 

 

...... 

//write to C to matC 

====================================================== 

According to the report, there 8 threads being pipelined from loop "for(int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) ", thus my A and B are replicated 7 times. Is it possible to prevent the memory being replicated? 

 

Any advice would be greatly appreciated! 

Lancer Chiang
0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
355 Views

Please post your full compilation report or full kernel. Assuming that this is the type of replication the compiler performs to allow a certain number of parallel iterations, you can control this level of parallelism by using the "#pragma max_concurrency" before the loop.

0 Kudos
Altera_Forum
Honored Contributor II
355 Views

Hi HRZ, thanks for your reply! That's a good advice! (I did not know I can add this haha)

0 Kudos
Altera_Forum
Honored Contributor II
355 Views

Hi HRZ, here is the report of one of my local memory: 

 

conv.cl:149 (data): 

Local memory: Potentially inefficient configuration. 

Requested size 65536 bytes (rounded up to nearest power of 2), implemented size 458752 bytes, replicated 7 times total, stallable, 64 reads and 1 write. Additional information: 

- Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. 

- Replicated 7 times to create private copies for simultaneous execution of 7 threads in the loop containing accesses to the array. 

- Banked on lowest dimension into 64 separate banks (this is a good thing). 

 

I don't understand what's the seven threads.
0 Kudos
Altera_Forum
Honored Contributor II
355 Views

Latency of accesses to multi-ported on-chip buffers is not one cycle; hence, the compiler has to further replicate the buffer that is accessed in the loop so that loop iterations in-flight in the pipeline can access different copies of the same buffer in parallel, resulting in correct full-pipelining and an initiation interval of one. If the "#pragma max_concurrency" I mentioned above does not reduce this replication factor (e.g.# pragma max_concurrency 2), then using# pragma II might (e.g.# pragma II 3). Note that all of these come at the cost of lower performance, probably MUCH lower performance.

0 Kudos
Altera_Forum
Honored Contributor II
355 Views

Hi HRZ, I have some problems I could not solve in my design for a very long time, could you please help me a bit?

0 Kudos
Reply