Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
1,265 Views

How to reduce M20K usage ?

Hi, I have problem when read some data from global memory. 

 

typdef struct{ float data } vector_line; typdef struct{ vector_line lane } channel_vec; __kernel void ReadBlock( __global const vector_line *restrict img, __global channel_vec *restrict coef ) { vector_line img_vec; channel_vec img_ch_vec; channel_vec coef_ch_vec; img_vec = img; coef_ch_vec = coef; }  

 

This line  

coef_ch_vec = coef[some addr]  

on preliminary report consume 600+ M20K blocks 

After compilation I saw 310 M20K blocks. 

It is too much for storage array 16*8*4 = 512 bytes. 

How can I reduce this quantity of M20K blocks.
0 Kudos
8 Replies
Altera_Forum
Honored Contributor I
98 Views

This code snippet is not enough to explain why. Block RAMs on the FPGA only have two ports. Each buffer is replicated based on the number of accesses to it, further increasing the Block RAM requirement. In NDRange kernels, each buffer is further replicated to allow simultaneous accesses from work groups running in parallel. Previous versions of the report clearly showed the requested buffer size, number of reads and writes, number of replications, and final implemented size. This was not ported to the HTML report available in v16.1. However, the report in v17 has a new tab that shows some info about the local buffers; I have never personally used the latter, though.

Altera_Forum
Honored Contributor I
98 Views

channel channel_vec coef_ch __attribute__((depth(0))); typdef struct{ float data } vector_line; typdef struct{ vector_line lane } channel_vec; __kernel void ReadBlock( uchar dim1, uchar dim2, // Data __global channel_vec *restrict coef) { int loc_x = get_local_id(0); int loc_y = get_local_id(1); int loc_z = get_local_id(2); int block_x = get_group_id(0); int block_y = get_group_id(1); int block_z = get_group_id(2); channel_vec coeft_ch_vec; coef_ch_vec = coef; write_channel_altera(coef_ch, coef_ch_vec); }  

 

 

This is more detailed snippet of code. Is it enough?
Altera_Forum
Honored Contributor I
98 Views

I just see one read from the "coef" buffer in your code; this is certainly not the reason why you are getting such high M20K utilization. 

 

If you are using Quartus/AOC v16.1.2 or below, it seems the old report can still be obtained by running "aocl analyze-area" on the aoc/aocx file. Can you generate and post that report?
Altera_Forum
Honored Contributor I
98 Views

It's report of area

Altera_Forum
Honored Contributor I
98 Views

Based on your log, 309 RAMs are being used by the BSP, 103 are being used by the channel, and 643 and 83 RAMs for two memory loads. 

 

You cannot change or reduce the amount used by the BSP. 

The channel depth you have requested in zero, but the compiler has decided that a depth of 4096 is better for you, hence the high RAM requirement. Channel depth is one of the things that the compiler regularly overestimates, yet there is no way to override it by the user. 

The RAMs used for the external memory loads are mostly used for the private cache. You can reduce this amount by adding the "volatile" tag to your __global "coef" buffer. The cache can help a lot if your code does a lot of repeated accesses, but if it doesn't, the cache will be useless and just waste RAMs. There will still be some RAMs used for the access even with volatile tag, and that is because the compiler tries to hide the latency of the memory accesses by putting buffers between the kernel and the memory interface.
Altera_Forum
Honored Contributor I
98 Views

Thank you for this explanation.  

How can I choose parameter for optimizing by FPGA area usage, not by perfomance ?  

For example in Quartus I have this possibility
Altera_Forum
Honored Contributor I
98 Views

There is no such option in OpenCL. The OpenCL compiler by default optimizes your code for the highest throughput, and uses as much area as possible to achieve this goal.

Altera_Forum
Honored Contributor I
98 Views

 

--- Quote Start ---  

Based on your log, 309 RAMs are being used by the BSP, 103 are being used by the channel, and 643 and 83 RAMs for two memory loads. 

 

You cannot change or reduce the amount used by the BSP. 

The channel depth you have requested in zero, but the compiler has decided that a depth of 4096 is better for you, hence the high RAM requirement. Channel depth is one of the things that the compiler regularly overestimates, yet there is no way to override it by the user. 

The RAMs used for the external memory loads are mostly used for the private cache. You can reduce this amount by adding the "volatile" tag to your __global "coef" buffer. The cache can help a lot if your code does a lot of repeated accesses, but if it doesn't, the cache will be useless and just waste RAMs. There will still be some RAMs used for the access even with volatile tag, and that is because the compiler tries to hide the latency of the memory accesses by putting buffers between the kernel and the memory interface. 

--- Quote End ---  

 

 

channel channel_vec coef_ch __attribute__((depth(0))); typdef struct{ float data } vector_line; typdef struct{ vector_line lane } channel_vec; __kernel void ReadBlock( uchar dim1, uchar dim2, // Data __global volatile channel_vec *restrict coef) { int loc_x = get_local_id(0); int loc_y = get_local_id(1); int loc_z = get_local_id(2); int block_x = get_group_id(0); int block_y = get_group_id(1); int block_z = get_group_id(2); channel_vec coeft_ch_vec; coef_ch_vec = coef; write_channel_altera(coef_ch, coef_ch_vec);  

 

I added 'volatile' to this code, but there are no effect for RAMs usage. Is it possible ?
Reply