undesired BRAM replication

Altera_Forum · ‎10-28-2017

I am using a single work item kernel to do matrix multiplication, and my BRAM usage explored (estimated 100+% BRAM usage while only 16% for DSP).

==============================================================================================

#define MAT_A_ROWS 128

#define MAT_A_COLS 64

#define MAT_B_ROWS MAT_A_COLS

#define MAT_B_COLS 128

#define BLOCK_SIZE 16

__kernel

__attribute__((task))

void matrix_mult(__global float *restrict matA, __global float *restrict matB, __global float *restrict matC, ) {

__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(1))) A[BLOCK_SIZE][BLOCK_SIZE];

__local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(BLOCK_SIZE))) B[BLOCK_SIZE][BLOCK_SIZE];

__local C[BLOCK_SIZE][BLOCK_SIZE];

for (int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) {

for(int i = 0; i < BLOCK_SIZE; i++) {

for(int j = 0; j < BLOCK_SIZE; j++) {

A[j] = mata[.....];

}

for(int i = 0; i < block_size; i++) {

for(int j = 0; j < block_size; j++) {

b[j] = matB[.....];

}

for(int i = 0; i < BLOCK_SIZE; i++){

for(int j = 0; j < BLOCK_SIZE; j++) {

float running_sum = 0;

for(int k = 0; k < BLOCK_SIZE; k ++) {

running_sum += A[k] + b[k][j];

}

c[j] += running_sum;

}

......

//write to C to matC

}

======================================================

According to the report, there 8 threads being pipelined from loop "for(int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) ", thus my A and B are replicated 7 times. Is it possible to prevent the memory being replicated?

Any advice would be greatly appreciated!

Lancer Chiang

Altera_Forum · ‎10-28-2017

Please post your full compilation report or full kernel. Assuming that this is the type of replication the compiler performs to allow a certain number of parallel iterations, you can control this level of parallelism by using the "#pragma max_concurrency" before the loop.

Altera_Forum · ‎10-29-2017

Hi HRZ, thanks for your reply! That's a good advice! (I did not know I can add this haha)

Altera_Forum · ‎10-30-2017

Hi HRZ, here is the report of one of my local memory:

conv.cl:149 (data):

Local memory: Potentially inefficient configuration.

Requested size 65536 bytes (rounded up to nearest power of 2), implemented size 458752 bytes, replicated 7 times total, stallable, 64 reads and 1 write. Additional information:

- Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension.

- Replicated 7 times to create private copies for simultaneous execution of 7 threads in the loop containing accesses to the array.

- Banked on lowest dimension into 64 separate banks (this is a good thing).

I don't understand what's the seven threads.

Altera_Forum · ‎10-30-2017

Latency of accesses to multi-ported on-chip buffers is not one cycle; hence, the compiler has to further replicate the buffer that is accessed in the loop so that loop iterations in-flight in the pipeline can access different copies of the same buffer in parallel, resulting in correct full-pipelining and an initiation interval of one. If the "#pragma max_concurrency" I mentioned above does not reduce this replication factor (e.g.# pragma max_concurrency 2), then using# pragma II might (e.g.# pragma II 3). Note that all of these come at the cost of lower performance, probably MUCH lower performance.

Altera_Forum · ‎11-07-2017

Hi HRZ, I have some problems I could not solve in my design for a very long time, could you please help me a bit?