Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16608 Discussions

cache data on chip ram

Altera_Forum
Honored Contributor II
1,207 Views

Hi, 

 

Say i have a large dataset that i wanted to cache it on chip ram and the following code is what i wrote. 

 

__kernel(global *input, uint g, uint data_length){ __local memory data_buffer; for(a++){ //for every iteration in 'a' loop //renew data on chip ram for(g++){ //assume this is line 4 data_buffer = input; } for(b++){ for(g++){ // load data from __local memory data = data_buffer } } } 

 

The report shows me i have memory dependency in line 4, causing the whole block to execute serially. Any advice on how i can fix this or modify it to remove the dependency? 

I saw some people uses specific pragma to counter this issue e.g.# pragma ivdep but i have no idea what it is.  

g varries for each kernel execution. So, put unroll pragma doesnt work. Thank you
0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
430 Views

If you post your actual code, or a code snippet that compiles so that we can see the actual dependency report, it would be a lot easier to judge. However, based on your pseudo-code, I can tell you have a read after write dependency on the "data_buffer". Access latency to Block RAM-based on-chip buffers is not single cycle and hence, such dependencies are unavoidable unless you make your on-chip buffer small enough so that it can be implemented using registers. Using# pragma ivdep will be incorrect in your case since the dependency the compiler detects is a real dependency.

0 Kudos
Altera_Forum
Honored Contributor II
430 Views

My apologies, i thought i could simplify my code. Anyway, heres my actual code 

 

# define Max 2046 //maximum data can be stored in data_buffer __kernel void kernel1( global *restrict data, uint col, uint size, uint row){ __local float data_buffer; float data; for(int i=0;i<size;++i){ //preload data into data_buffer so it can reuse // change new dataset everytime size++; for(int g; g< col; ++g){ data_buffer = data; } for(int h=0;h<row;++h){ for(int f=0; f<col; ++f){ data = data_buffer; //use back the same data stored in data_buffer writeintelchannel(data_ch,data); //send to another kernel to compute } } }  

 

Here, i wanted to put data on chip due to its smaller latency to access global since it repeatedly taking the same data. the data preloader should run everytime size increment and has no effect on row/col iteration.
0 Kudos
Altera_Forum
Honored Contributor II
430 Views

Please post a code snippet that compiles. Your new code snippet does not compile either due to multiple coding errors. 

 

Or, archive and attach the "report" folder so that I can check the dependency report.
0 Kudos
Altera_Forum
Honored Contributor II
430 Views

Hi HRZ, I'm truly sorry. I get confused. Anyway, heres the report. 

 

On the contrary, I get better performance despite the bottleneck.
0 Kudos
Altera_Forum
Honored Contributor II
430 Views

Some of your __local buffers seem unnecessary to me. For the first kernel, I think you can remove the dependency by reordering the i and j loop, convert the weight_ocr buffer to a single scoped variable, and move the load from external memory between the j and i loop as follows: 

 

for( a = 0 ; a < depth ; ++a){ for( j = 0 ; j < col ; ++j){ lane_data weight_ocr = weights; for( i = 0 ; i < row ; ++i){ # pragma unroll for( k = 0; k < LANE_NUM ; ++k){ data_ch_vec.lane = input; //lanenum*col can pass as param port bcaz they are constant // here use 8 dsp //lc = lane_num*col //printf("Lane:%d %f %f %f \n",k,data_ch_vec.lane.data,data_ch_vec.lane.data,data_ch_vec.lane.data); } //load weights weight_buffer = weight_ocr; //0,1,2,3,4,5,6 repeat until new filter 7,8,9,10,11,12,13 write_channel_altera(weight_ch,weight_buffer); write_channel_altera(data_ch,data_ch_vec); } } } 

 

This removes the memory dependency; however, it might break your function so make sure that it works correctly before using it. 

 

For the second kernel, a similar thing can be done. The conv_out buffer does not need to be a __local buffer; you can just replace it with a single scoped variable as follows: 

 

# pragma unroll for(unsigned char ll=0; ll<LANE_NUM; ll++){ float conv_out = 0; # pragma unroll for(unsigned i=0; i<PIPE_DEPTH; i++){ conv_out += accum_piped; } conv_ch_in.data = conv_out; 

 

This removes the dependency in your second kernel.
0 Kudos
Reply