Solved: Local memory ram block consumption opencl report

JJaco16 · ‎08-25-2021

Hi,

I have a local memory declared in my kernel code, which is of type int4 and declared as following

local int4 E[16][2]

My single work item kernel code accesses this array with 3 reads and 3 writes within a loop with static memory access on the lower dimension. As expected, the compiler banked the memory with 2 banks for parallel access on the lower dimension and double clocked the memory for more read and write ports.

As such the memory is small and as you can see it takes only 512 bytes for implementation. The compiler report in the memory details confirms with a 512 byte implementation of the memory.

However, I'm not able to see how the block RAMs were calculated for the same.

I have attached the memory reports generated by the compiler for this memory, and as you can see the RAMs consumed by this memory is 14 blocks, which is way more than expected.

This was implemented on an arria10 soc development board with M20k RAMs. Accordingly the amount of memory with one M20k RAM comes as 32*512/8 = 2048 bytes. So only one M20k RAM block should have been enough for its implementation

Why does the compiler allot this memory 14 RAM blocks even though it mentions in the report that 512 bytes requested, 512 bytes implemented?

HRZ · ‎08-28-2021

Things have changed quite a bit with respect to M20K replication in the compiler since the last time I used it but there are a few things to have in mind in your case:

- Your buffer has a width of 128 bits. The physical width of M20K ports is 32 bits. As such, a minimum of four M20Ks will be required to support a width of 128 bits even if your buffer is small enough to fit in one M20K.

- Each M20K has only two physical ports, which, with double-pumping, will be extended to 4 virtual ports. However, you buffer requires 6 simultaneous reads and writes in total and as such, again, it is impossible to implement it using only one M20K due to lack of enough ports.

The way I would count the replication factor is that all write ports need to be connected to all M20Ks used for the buffer, while each read port only needs to be connected to one. With double-pumping, 3 out of the 4 virtual ports in each M20K will be occupied by write ports, so three M20Ks will be needed to support all the 6 simultaneous accesses. On top of that, the whole structure needs to be replicated 4 more time to support a width of 128 bits, leading to a total of 12 M20Ks required to implement your buffer. I am not sure why the compiler is counting 14 here, though. The replication factors are usually mentioned in the "Additional information" part of the report which seems to have been cut off in your screenshot. Is there anything else mentioned in that part of the report?

View solution in original post

AnilErinch_A_Intel · ‎08-25-2021

Hi Jefin,

Thanks for contacting the community support.

Can you please provide the code snippet of the loop which is mentioned and whether any pragmas are applied on the loop also.

Thanks and Regards

Anil

JJaco16 · ‎08-26-2021

channel float chanout_float __attribute__((depth(0)));
channel char chanout_binary __attribute__((depth(0)));

__attribute__((max_global_work_dim(0)))
 kernel void test_kernel(global int4 *restrict frame_out)
{
    local int4 E[16][2]; 
    float max_arr[16][2] __attribute__((memory,numbanks(2),bankwidth(4),doublepump,,numreadports(2),numwriteports(2)));
    local int4 label_arr[128];
    float float_dat;
    char bin_dat[3];
    int count =0 ;
    for(short j = 0; j<512;j++)
	{
        #pragma unroll
		for(char k = 0; k<3;k++)
			bin_dat[k] = 0;

        int E_span = 0;
        MAXVAL = 0;
        for(short  i = 0; i<1024;i++)
        {

            #pragma unroll
			for(char k =0 ; k<2;k++)
				bin_dat[k] = bin_dat[k+1];

            bin_dat[2] = read_channel_intel(chanout_binary); 
			float_dat = read_channel_intel(chanout_float);
            bool det = bin_dat[2];
			float E_val = float_dat;
            bool max_cond = (MAXVAL>E_val)||((det==0));
            E_flag_st = det||pri1[1]||pri1[0];
            if(E_flag_st)
            {
                E_span = E_span + E_flag_st; 
                MAXIND      = max_cond?MAXIND:(j*1024+i);
            }
            else
            {
                E[E_count][1] = (int4)(i-E_span,E_span,(i==0)?count:32767,MAXIND);
                max_arr[E_count][1] = MAXVAL;
                count = (i==0)?(count+1):count;
                E_count = E_count +1;
            }
        }



        #pragma ivdep 
        for(char l = 0 ; l<16; l++)
        {	
            int4 tempE1 = E[l][1];
            float pres_maxval = max_arr[l][1];

            #pragma ivdep 
            for(char k = 0 ; k<16;k++)
            {
                int4 tempE0 = E[k][0];
                
                if(<condition>)
                {

                }
            }
            max_arr[l][1] = <updated value>;
            E[l][1].s23 = <updated value>;
        }

        for(char l = 0;l<16;l++)
		{
            E[l][0] = E[l][1];
            max_arr[l][0] = max_arr[l][1];
            int4 tempE1 = E[l][1];
			short int E1_ind = tempE1.s2;
			if(<condition>)
			{
                label_arr[E1_ind] = (int4)(E1_ind,*(int *)&max_arr[l][1],tempE1.s3,0);
			}
			
		}

    }


    int label_count_final = 0;
    for(short i = 0; i<128;i++)
    {
        int4 temp_labelarr = label_arr[i];

        if(temp_labelarr.s0!=0)
        {
            frame_out[label_count_final & 127] = (int4)(label_count_final+1,temp_labelarr.s1, temp_labelarr.s2, 0);
            label_count_final = label_count_final + 1;
        }
        else
            frame_out[label_count_final & 127] = 0;
    }
}

JJaco16 · ‎08-26-2021

as you can see I have used the #pragma ivdep in two loops

HRZ · ‎08-28-2021

Things have changed quite a bit with respect to M20K replication in the compiler since the last time I used it but there are a few things to have in mind in your case:

- Your buffer has a width of 128 bits. The physical width of M20K ports is 32 bits. As such, a minimum of four M20Ks will be required to support a width of 128 bits even if your buffer is small enough to fit in one M20K.

- Each M20K has only two physical ports, which, with double-pumping, will be extended to 4 virtual ports. However, you buffer requires 6 simultaneous reads and writes in total and as such, again, it is impossible to implement it using only one M20K due to lack of enough ports.

The way I would count the replication factor is that all write ports need to be connected to all M20Ks used for the buffer, while each read port only needs to be connected to one. With double-pumping, 3 out of the 4 virtual ports in each M20K will be occupied by write ports, so three M20Ks will be needed to support all the 6 simultaneous accesses. On top of that, the whole structure needs to be replicated 4 more time to support a width of 128 bits, leading to a total of 12 M20Ks required to implement your buffer. I am not sure why the compiler is counting 14 here, though. The replication factors are usually mentioned in the "Additional information" part of the report which seems to have been cut off in your screenshot. Is there anything else mentioned in that part of the report?

JJaco16 · ‎08-31-2021

I did not find any other separate details panel apart from the ones that I already posted. However this clears my doubt as to why so many M20ks were being used. As to why 2 extra M20ks being put up by the compiler is still an open question, I'll do some more compilations to see when the compiler starts assigning extra M20ks.

Thanks

AnilErinch_A_Intel · ‎09-07-2021

Hi Jefin,

There are some optimizations where you can fine tune the memory usage.

But it is usually a complexity versus use case trade off.

Thanks for having this interesting discussion with the community.

Thanks and Regards

Anil