Re: cache data on chip ram

Altera_Forum · ‎03-31-2018

Hi,

Say i have a large dataset that i wanted to cache it on chip ram and the following code is what i wrote.


__kernel(global *input, uint g, uint data_length){
__local memory data_buffer;
for(a++){
    //for every iteration in 'a' loop
    //renew data on chip ram
    for(g++){ //assume this is line 4
       data_buffer = input;
    }
   for(b++){ 
       for(g++){ 
           // load data from __local memory
              data = data_buffer
        }
    }
}

The report shows me i have memory dependency in line 4, causing the whole block to execute serially. Any advice on how i can fix this or modify it to remove the dependency?

I saw some people uses specific pragma to counter this issue e.g.# pragma ivdep but i have no idea what it is.

g varries for each kernel execution. So, put unroll pragma doesnt work. Thank you

Altera_Forum · ‎03-31-2018

If you post your actual code, or a code snippet that compiles so that we can see the actual dependency report, it would be a lot easier to judge. However, based on your pseudo-code, I can tell you have a read after write dependency on the "data_buffer". Access latency to Block RAM-based on-chip buffers is not single cycle and hence, such dependencies are unavoidable unless you make your on-chip buffer small enough so that it can be implemented using registers. Using# pragma ivdep will be incorrect in your case since the dependency the compiler detects is a real dependency.

Altera_Forum · ‎04-01-2018

My apologies, i thought i could simplify my code. Anyway, heres my actual code

# define Max 2046 //maximum data can be stored in data_buffer
__kernel void kernel1( global *restrict data, uint col, uint size, uint row){
__local float data_buffer;
float data;
for(int i=0;i<size;++i){
        
      //preload data into data_buffer so it can reuse 
     // change new dataset everytime size++;
       for(int g; g< col; ++g){
           data_buffer = data;
       }
     for(int h=0;h<row;++h){
          for(int f=0; f<col; ++f){
               data = data_buffer; //use back the same data stored in data_buffer
               writeintelchannel(data_ch,data); //send to another kernel to compute
           }
      }
}

Here, i wanted to put data on chip due to its smaller latency to access global since it repeatedly taking the same data. the data preloader should run everytime size increment and has no effect on row/col iteration.

Altera_Forum · ‎04-01-2018

Please post a code snippet that compiles. Your new code snippet does not compile either due to multiple coding errors.

Or, archive and attach the "report" folder so that I can check the dependency report.

Altera_Forum · ‎04-02-2018

Hi HRZ, I'm truly sorry. I get confused. Anyway, heres the report.

On the contrary, I get better performance despite the bottleneck.

Altera_Forum · ‎04-02-2018

Some of your __local buffers seem unnecessary to me. For the first kernel, I think you can remove the dependency by reordering the i and j loop, convert the weight_ocr buffer to a single scoped variable, and move the load from external memory between the j and i loop as follows:

            for( a = 0 ; a < depth ; ++a){
                for( j = 0 ; j < col ; ++j){
                    lane_data weight_ocr = weights;
                    for( i = 0 ; i < row ; ++i){    
                    
                       # pragma unroll
                        for( k = 0; k < LANE_NUM ; ++k){
                        data_ch_vec.lane = input; //lanenum*col can pass as param port bcaz they are constant // here use 8 dsp //lc = lane_num*col
                        //printf("Lane:%d %f %f %f \n",k,data_ch_vec.lane.data,data_ch_vec.lane.data,data_ch_vec.lane.data);
                        }
                        
                        //load weights 
                        weight_buffer = weight_ocr; //0,1,2,3,4,5,6 repeat until new filter 7,8,9,10,11,12,13
                        write_channel_altera(weight_ch,weight_buffer);    
                        write_channel_altera(data_ch,data_ch_vec);
                    }    
                }                    
            }

This removes the memory dependency; however, it might break your function so make sure that it works correctly before using it.

For the second kernel, a similar thing can be done. The conv_out buffer does not need to be a __local buffer; you can just replace it with a single scoped variable as follows:

           # pragma unroll
            for(unsigned char ll=0; ll<LANE_NUM; ll++){
            float conv_out = 0;
           # pragma unroll
            for(unsigned i=0; i<PIPE_DEPTH; i++){
                conv_out += accum_piped;
            }
            conv_ch_in.data = conv_out;

This removes the dependency in your second kernel.