How to reduce M20K usage ?

Altera_Forum · ‎10-03-2017

Hi, I have problem when read some data from global memory.


typdef struct{
   float  data
} vector_line; 
typdef struct{
   vector_line lane
} channel_vec;
__kernel
void ReadBlock(
__global const vector_line *restrict img,
__global channel_vec *restrict coef
) 
{
vector_line img_vec;
channel_vec img_ch_vec;
channel_vec coef_ch_vec;
img_vec = img;
coef_ch_vec = coef;
}

This line

coef_ch_vec = coef[some addr]

on preliminary report consume 600+ M20K blocks

After compilation I saw 310 M20K blocks.

It is too much for storage array 16*8*4 = 512 bytes.

How can I reduce this quantity of M20K blocks.

Altera_Forum · ‎10-03-2017

This code snippet is not enough to explain why. Block RAMs on the FPGA only have two ports. Each buffer is replicated based on the number of accesses to it, further increasing the Block RAM requirement. In NDRange kernels, each buffer is further replicated to allow simultaneous accesses from work groups running in parallel. Previous versions of the report clearly showed the requested buffer size, number of reads and writes, number of replications, and final implemented size. This was not ported to the HTML report available in v16.1. However, the report in v17 has a new tab that shows some info about the local buffers; I have never personally used the latter, though.

Altera_Forum · ‎10-03-2017

channel channel_vec coef_ch __attribute__((depth(0))); typdef struct{ float data } vector_line; typdef struct{ vector_line lane } channel_vec; __kernel void ReadBlock( uchar dim1, uchar dim2, // Data __global channel_vec *restrict coef) { int loc_x = get_local_id(0); int loc_y = get_local_id(1); int loc_z = get_local_id(2); int block_x = get_group_id(0); int block_y = get_group_id(1); int block_z = get_group_id(2); channel_vec coeft_ch_vec; coef_ch_vec = coef; write_channel_altera(coef_ch, coef_ch_vec); }

This is more detailed snippet of code. Is it enough?

Altera_Forum · ‎10-03-2017

I just see one read from the "coef" buffer in your code; this is certainly not the reason why you are getting such high M20K utilization.

If you are using Quartus/AOC v16.1.2 or below, it seems the old report can still be obtained by running "aocl analyze-area" on the aoc/aocx file. Can you generate and post that report?

Altera_Forum · ‎10-03-2017

It's report of area

Altera_Forum · ‎10-04-2017

Based on your log, 309 RAMs are being used by the BSP, 103 are being used by the channel, and 643 and 83 RAMs for two memory loads.

You cannot change or reduce the amount used by the BSP.

The channel depth you have requested in zero, but the compiler has decided that a depth of 4096 is better for you, hence the high RAM requirement. Channel depth is one of the things that the compiler regularly overestimates, yet there is no way to override it by the user.

The RAMs used for the external memory loads are mostly used for the private cache. You can reduce this amount by adding the "volatile" tag to your __global "coef" buffer. The cache can help a lot if your code does a lot of repeated accesses, but if it doesn't, the cache will be useless and just waste RAMs. There will still be some RAMs used for the access even with volatile tag, and that is because the compiler tries to hide the latency of the memory accesses by putting buffers between the kernel and the memory interface.

Altera_Forum · ‎10-04-2017

Thank you for this explanation.

How can I choose parameter for optimizing by FPGA area usage, not by perfomance ?

For example in Quartus I have this possibility

Altera_Forum · ‎10-04-2017

There is no such option in OpenCL. The OpenCL compiler by default optimizes your code for the highest throughput, and uses as much area as possible to achieve this goal.

Altera_Forum · ‎10-04-2017

--- Quote Start ---

Based on your log, 309 RAMs are being used by the BSP, 103 are being used by the channel, and 643 and 83 RAMs for two memory loads.

You cannot change or reduce the amount used by the BSP.

The channel depth you have requested in zero, but the compiler has decided that a depth of 4096 is better for you, hence the high RAM requirement. Channel depth is one of the things that the compiler regularly overestimates, yet there is no way to override it by the user.

The RAMs used for the external memory loads are mostly used for the private cache. You can reduce this amount by adding the "volatile" tag to your __global "coef" buffer. The cache can help a lot if your code does a lot of repeated accesses, but if it doesn't, the cache will be useless and just waste RAMs. There will still be some RAMs used for the access even with volatile tag, and that is because the compiler tries to hide the latency of the memory accesses by putting buffers between the kernel and the memory interface.

--- Quote End ---


channel channel_vec    coef_ch  __attribute__((depth(0)));
typdef struct{
   float  data
} vector_line; 
typdef struct{
   vector_line lane
} channel_vec;
 __kernel 
void ReadBlock( 
 			uchar dim1, 
 			uchar dim2, 
 			// Data
 			__global volatile channel_vec  *restrict coef) 
 { 
	 
 	int loc_x = get_local_id(0); 
 	int loc_y = get_local_id(1); 
 	int loc_z = get_local_id(2); 
 	int block_x = get_group_id(0); 
 	int block_y = get_group_id(1); 
 	int block_z = get_group_id(2); 
 	channel_vec coeft_ch_vec; 
 
	coef_ch_vec = coef; 
	  
	write_channel_altera(coef_ch, coef_ch_vec);

I added 'volatile' to this code, but there are no effect for RAMs usage. Is it possible ?