- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Say i have a large dataset that i wanted to cache it on chip ram and the following code is what i wrote.
__kernel(global *input, uint g, uint data_length){
__local memory data_buffer;
for(a++){
//for every iteration in 'a' loop
//renew data on chip ram
for(g++){ //assume this is line 4
data_buffer = input;
}
for(b++){
for(g++){
// load data from __local memory
data = data_buffer
}
}
}
The report shows me i have memory dependency in line 4, causing the whole block to execute serially. Any advice on how i can fix this or modify it to remove the dependency? I saw some people uses specific pragma to counter this issue e.g.# pragma ivdep but i have no idea what it is. g varries for each kernel execution. So, put unroll pragma doesnt work. Thank you
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you post your actual code, or a code snippet that compiles so that we can see the actual dependency report, it would be a lot easier to judge. However, based on your pseudo-code, I can tell you have a read after write dependency on the "data_buffer". Access latency to Block RAM-based on-chip buffers is not single cycle and hence, such dependencies are unavoidable unless you make your on-chip buffer small enough so that it can be implemented using registers. Using# pragma ivdep will be incorrect in your case since the dependency the compiler detects is a real dependency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apologies, i thought i could simplify my code. Anyway, heres my actual code
# define Max 2046 //maximum data can be stored in data_buffer
__kernel void kernel1( global *restrict data, uint col, uint size, uint row){
__local float data_buffer;
float data;
for(int i=0;i<size;++i){
//preload data into data_buffer so it can reuse
// change new dataset everytime size++;
for(int g; g< col; ++g){
data_buffer = data;
}
for(int h=0;h<row;++h){
for(int f=0; f<col; ++f){
data = data_buffer; //use back the same data stored in data_buffer
writeintelchannel(data_ch,data); //send to another kernel to compute
}
}
}
Here, i wanted to put data on chip due to its smaller latency to access global since it repeatedly taking the same data. the data preloader should run everytime size increment and has no effect on row/col iteration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please post a code snippet that compiles. Your new code snippet does not compile either due to multiple coding errors.
Or, archive and attach the "report" folder so that I can check the dependency report.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some of your __local buffers seem unnecessary to me. For the first kernel, I think you can remove the dependency by reordering the i and j loop, convert the weight_ocr buffer to a single scoped variable, and move the load from external memory between the j and i loop as follows:
for( a = 0 ; a < depth ; ++a){
for( j = 0 ; j < col ; ++j){
lane_data weight_ocr = weights;
for( i = 0 ; i < row ; ++i){
# pragma unroll
for( k = 0; k < LANE_NUM ; ++k){
data_ch_vec.lane = input; //lanenum*col can pass as param port bcaz they are constant // here use 8 dsp //lc = lane_num*col
//printf("Lane:%d %f %f %f \n",k,data_ch_vec.lane.data,data_ch_vec.lane.data,data_ch_vec.lane.data);
}
//load weights
weight_buffer = weight_ocr; //0,1,2,3,4,5,6 repeat until new filter 7,8,9,10,11,12,13
write_channel_altera(weight_ch,weight_buffer);
write_channel_altera(data_ch,data_ch_vec);
}
}
}
This removes the memory dependency; however, it might break your function so make sure that it works correctly before using it. For the second kernel, a similar thing can be done. The conv_out buffer does not need to be a __local buffer; you can just replace it with a single scoped variable as follows: # pragma unroll
for(unsigned char ll=0; ll<LANE_NUM; ll++){
float conv_out = 0;
# pragma unroll
for(unsigned i=0; i<PIPE_DEPTH; i++){
conv_out += accum_piped;
}
conv_ch_in.data = conv_out;
This removes the dependency in your second kernel.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page