- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using a single work item kernel to do matrix multiplication, and my BRAM usage explored (estimated 100+% BRAM usage while only 16% for DSP).
============================================================================================== #define MAT_A_ROWS 128 #define MAT_A_COLS 64 #define MAT_B_ROWS MAT_A_COLS #define MAT_B_COLS 128 #define BLOCK_SIZE 16 __kernel __attribute__((task)) void matrix_mult(__global float *restrict matA, __global float *restrict matB, __global float *restrict matC, ) { __local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(1))) A[BLOCK_SIZE][BLOCK_SIZE]; __local float __attribute__((num_banks(BLOCK_SIZE), bandwidth(BLOCK_SIZE))) B[BLOCK_SIZE][BLOCK_SIZE]; __local C[BLOCK_SIZE][BLOCK_SIZE]; for (int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) { for(int i = 0; i < BLOCK_SIZE; i++) { for(int j = 0; j < BLOCK_SIZE; j++) { A[j] = mata[.....];}
}
for(int i = 0; i < block_size; i++) {
for(int j = 0; j < block_size; j++) {
b[j] = matB[.....]; } } for(int i = 0; i < BLOCK_SIZE; i++){ for(int j = 0; j < BLOCK_SIZE; j++) { float running_sum = 0; for(int k = 0; k < BLOCK_SIZE; k ++) { running_sum += A[k] + b[k][j];
}
c[j] += running_sum; } } } ...... //write to C to matC } ====================================================== According to the report, there 8 threads being pipelined from loop "for(int k = 0; k < MAT_A_COLS / BLOCK_SIZE; k++) ", thus my A and B are replicated 7 times. Is it possible to prevent the memory being replicated? Any advice would be greatly appreciated! Lancer Chiang
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please post your full compilation report or full kernel. Assuming that this is the type of replication the compiler performs to allow a certain number of parallel iterations, you can control this level of parallelism by using the "#pragma max_concurrency" before the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi HRZ, thanks for your reply! That's a good advice! (I did not know I can add this haha)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi HRZ, here is the report of one of my local memory:
conv.cl:149 (data): Local memory: Potentially inefficient configuration. Requested size 65536 bytes (rounded up to nearest power of 2), implemented size 458752 bytes, replicated 7 times total, stallable, 64 reads and 1 write. Additional information: - Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. - Replicated 7 times to create private copies for simultaneous execution of 7 threads in the loop containing accesses to the array. - Banked on lowest dimension into 64 separate banks (this is a good thing). I don't understand what's the seven threads.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Latency of accesses to multi-ported on-chip buffers is not one cycle; hence, the compiler has to further replicate the buffer that is accessed in the loop so that loop iterations in-flight in the pipeline can access different copies of the same buffer in parallel, resulting in correct full-pipelining and an initiation interval of one. If the "#pragma max_concurrency" I mentioned above does not reduce this replication factor (e.g.# pragma max_concurrency 2), then using# pragma II might (e.g.# pragma II 3). Note that all of these come at the cost of lower performance, probably MUCH lower performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi HRZ, I have some problems I could not solve in my design for a very long time, could you please help me a bit?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page