Double Buffering in OpenCL

SBioo · ‎05-01-2019

Hi All,

I'm trying to adopt the "Double Buffering" technique in one of my OpenCL codes. I have seen in few research papers that double buffering could help boost the performance. Although, they were all using the SDAccel toolchain by Xilinx. Now I want to do the same on Arria 10 FPGAs (Nallatech P385A), using OpenCL.

Here is the kind of code I have:

__local lane_data    win_buffer[2][WIN_BUF_SIZE];
 
for(unsigned int out_idx_xyz=0; out_idx_xyz<(weight_dim4_div_lane*group_num_y*group_num_x); out_idx_xyz++){
 
          flag = out_idx_xyz & 0x01; //ping-pong flag
 
         #pragma ivdep array(win_buffer)
         for(unsigned int win_itm_xyz = 0; win_itm_xyz < item_loop_bound; win_itm_xyz++) {
                     ....
                     if(win_itm_z<weight_dim3/VEC_SIZE){
                                    .....
                                    win_buffer[(~flag)&0x01][win_itm_z*win_size_y*win_size_x + win_itm_y*win_size_x + win_itm_x] = data_vec;
                                    .....
                      }
 
                      if(gp_num_x*CONV_GP_SIZE_X+gp_item_idx_x<conv_x){
                                   ......
                                   data_vec = win_buffer[flag][output_idx_dim3*win_size_y*win_size_x + output_idx_dim2*win_size_x + (output_idx_dim1+gp_item_idx_x*stride)];
                                   ......
                       }
                       ......
            }
}

As you can see, we have a `win_buffer` which should act as a double buffer. Unfortunately, the compiler detects the load-store to this buffer as a dependency, from the outer-loop perspective. I'm really not sure how we should instruct the compiler to infer a double buffer for the win_buffer.

Does anyone has any specific with respect to this issue?

In case the OpenCL compiler is not mature enough, should I have to split my kernel into two kernels, and somehow manually does this thing?

Thanks

HRZ · ‎05-01-2019

Unless the bound of the inner loop is known at compile-time, the compiler cannot be expected to resolve the dependency for the outer loop. Note that the compiler is not falsely detecting a dependency between accesses to win_buffer[0][x] and win_buffer[1][y] here but rather, since the inner loop could be long enough for more than two iterations of the outer loop being in-flight at the same time, it is detecting a true dependency between accesses to win_buffer[0][x] and win_buffer[0][x] happening in iteration i and i+2 of the outer loop.

SBioo · ‎05-01-2019

Interesting. Thanks for the help.

ausz · ‎05-16-2019

You've set up double buffering correctly on the inner loop, which should achieve II=1. But the compiler is right to say the outer loop cannot be pipelined. To execute two or more outer loop iterations in (pipeline) parallel would mean you were reading and writing to the same half of your double buffer simultaneously.

Is it bad for performance that the outer loop isn't pipelined? Maybe. The answer depends on the latency (let's say L) and iteration count (let's say N) of your inner loop. If you picture the occupancy of your loop as a function of time this is easy to see. If N >> L, your pipeline will be saturated most of the time. But if L >> N, your pipeline occupancy will never exceed N. In the first situation, outer loop pipelining would give you only an incremental performance increase. In the latter situation, outer loop pipelining would be essential to get satisfactory performance.

Is there a better way to write it? Yes. I made an assumption that you can refactor your code as below, though I can't be 100% sure from your snippet that it works in your situation. The key is that OpenCL can handle double- or multi-buffering for you automatically. Here's a code sketch:

for( ... ){  // The outer loop will be pipelined
   lane_data win_buffer[WIN_BUF_SIZE];  // This array will be automatically multi-buffered to support concurrent outer loop iterations
   // This shows up in the reports as "private copies"
 
   for(unsigned int win_itm_xyz = 0; win_itm_xyz < item_loop_bound; win_itm_xyz++) {
      ....
      if(win_itm_z<weight_dim3/VEC_SIZE){
         .....
         win_buffer[win_itm_z*win_size_y*win_size_x + win_itm_y*win_size_x + win_itm_x] = data_vec;
         .....
      }
   }
   for(unsigned int win_itm_xyz = 0; win_itm_xyz < item_loop_bound; win_itm_xyz++) {
      .... 
      if(gp_num_x*CONV_GP_SIZE_X+gp_item_idx_x<conv_x){
         ......
         data_vec = win_buffer[flag][output_idx_dim3*win_size_y*win_size_x + output_idx_dim2*win_size_x + (output_idx_dim1+gp_item_idx_x*stride)];
         ......
      }
      ......
   }
}