- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a local memory declared in my kernel code, which is of type int4 and declared as following
local int4 E[16][2]
My single work item kernel code accesses this array with 3 reads and 3 writes within a loop with static memory access on the lower dimension. As expected, the compiler banked the memory with 2 banks for parallel access on the lower dimension and double clocked the memory for more read and write ports.
As such the memory is small and as you can see it takes only 512 bytes for implementation. The compiler report in the memory details confirms with a 512 byte implementation of the memory.
However, I'm not able to see how the block RAMs were calculated for the same.
I have attached the memory reports generated by the compiler for this memory, and as you can see the RAMs consumed by this memory is 14 blocks, which is way more than expected.
This was implemented on an arria10 soc development board with M20k RAMs. Accordingly the amount of memory with one M20k RAM comes as 32*512/8 = 2048 bytes. So only one M20k RAM block should have been enough for its implementation
Why does the compiler allot this memory 14 RAM blocks even though it mentions in the report that 512 bytes requested, 512 bytes implemented?
- Tags:
- OpenCL FPGA
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Things have changed quite a bit with respect to M20K replication in the compiler since the last time I used it but there are a few things to have in mind in your case:
- Your buffer has a width of 128 bits. The physical width of M20K ports is 32 bits. As such, a minimum of four M20Ks will be required to support a width of 128 bits even if your buffer is small enough to fit in one M20K.
- Each M20K has only two physical ports, which, with double-pumping, will be extended to 4 virtual ports. However, you buffer requires 6 simultaneous reads and writes in total and as such, again, it is impossible to implement it using only one M20K due to lack of enough ports.
The way I would count the replication factor is that all write ports need to be connected to all M20Ks used for the buffer, while each read port only needs to be connected to one. With double-pumping, 3 out of the 4 virtual ports in each M20K will be occupied by write ports, so three M20Ks will be needed to support all the 6 simultaneous accesses. On top of that, the whole structure needs to be replicated 4 more time to support a width of 128 bits, leading to a total of 12 M20Ks required to implement your buffer. I am not sure why the compiler is counting 14 here, though. The replication factors are usually mentioned in the "Additional information" part of the report which seems to have been cut off in your screenshot. Is there anything else mentioned in that part of the report?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jefin,
Thanks for contacting the community support.
Can you please provide the code snippet of the loop which is mentioned and whether any pragmas are applied on the loop also.
Thanks and Regards
Anil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
channel float chanout_float __attribute__((depth(0)));
channel char chanout_binary __attribute__((depth(0)));
__attribute__((max_global_work_dim(0)))
kernel void test_kernel(global int4 *restrict frame_out)
{
local int4 E[16][2];
float max_arr[16][2] __attribute__((memory,numbanks(2),bankwidth(4),doublepump,,numreadports(2),numwriteports(2)));
local int4 label_arr[128];
float float_dat;
char bin_dat[3];
int count =0 ;
for(short j = 0; j<512;j++)
{
#pragma unroll
for(char k = 0; k<3;k++)
bin_dat[k] = 0;
int E_span = 0;
MAXVAL = 0;
for(short i = 0; i<1024;i++)
{
#pragma unroll
for(char k =0 ; k<2;k++)
bin_dat[k] = bin_dat[k+1];
bin_dat[2] = read_channel_intel(chanout_binary);
float_dat = read_channel_intel(chanout_float);
bool det = bin_dat[2];
float E_val = float_dat;
bool max_cond = (MAXVAL>E_val)||((det==0));
E_flag_st = det||pri1[1]||pri1[0];
if(E_flag_st)
{
E_span = E_span + E_flag_st;
MAXIND = max_cond?MAXIND:(j*1024+i);
}
else
{
E[E_count][1] = (int4)(i-E_span,E_span,(i==0)?count:32767,MAXIND);
max_arr[E_count][1] = MAXVAL;
count = (i==0)?(count+1):count;
E_count = E_count +1;
}
}
#pragma ivdep
for(char l = 0 ; l<16; l++)
{
int4 tempE1 = E[l][1];
float pres_maxval = max_arr[l][1];
#pragma ivdep
for(char k = 0 ; k<16;k++)
{
int4 tempE0 = E[k][0];
if(<condition>)
{
}
}
max_arr[l][1] = <updated value>;
E[l][1].s23 = <updated value>;
}
for(char l = 0;l<16;l++)
{
E[l][0] = E[l][1];
max_arr[l][0] = max_arr[l][1];
int4 tempE1 = E[l][1];
short int E1_ind = tempE1.s2;
if(<condition>)
{
label_arr[E1_ind] = (int4)(E1_ind,*(int *)&max_arr[l][1],tempE1.s3,0);
}
}
}
int label_count_final = 0;
for(short i = 0; i<128;i++)
{
int4 temp_labelarr = label_arr[i];
if(temp_labelarr.s0!=0)
{
frame_out[label_count_final & 127] = (int4)(label_count_final+1,temp_labelarr.s1, temp_labelarr.s2, 0);
label_count_final = label_count_final + 1;
}
else
frame_out[label_count_final & 127] = 0;
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
as you can see I have used the #pragma ivdep in two loops
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Things have changed quite a bit with respect to M20K replication in the compiler since the last time I used it but there are a few things to have in mind in your case:
- Your buffer has a width of 128 bits. The physical width of M20K ports is 32 bits. As such, a minimum of four M20Ks will be required to support a width of 128 bits even if your buffer is small enough to fit in one M20K.
- Each M20K has only two physical ports, which, with double-pumping, will be extended to 4 virtual ports. However, you buffer requires 6 simultaneous reads and writes in total and as such, again, it is impossible to implement it using only one M20K due to lack of enough ports.
The way I would count the replication factor is that all write ports need to be connected to all M20Ks used for the buffer, while each read port only needs to be connected to one. With double-pumping, 3 out of the 4 virtual ports in each M20K will be occupied by write ports, so three M20Ks will be needed to support all the 6 simultaneous accesses. On top of that, the whole structure needs to be replicated 4 more time to support a width of 128 bits, leading to a total of 12 M20Ks required to implement your buffer. I am not sure why the compiler is counting 14 here, though. The replication factors are usually mentioned in the "Additional information" part of the report which seems to have been cut off in your screenshot. Is there anything else mentioned in that part of the report?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did not find any other separate details panel apart from the ones that I already posted. However this clears my doubt as to why so many M20ks were being used. As to why 2 extra M20ks being put up by the compiler is still an open question, I'll do some more compilations to see when the compiler starts assigning extra M20ks.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jefin,
There are some optimizations where you can fine tune the memory usage.
But it is usually a complexity versus use case trade off.
Thanks for having this interesting discussion with the community.
Thanks and Regards
Anil
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page