What is the expected behavior when the same kernel is called multiple times on the FPGA, where the input is the output buffer as well.Suppose I have a vector increment kernel that I call K consecutive times. Moreover, the kernel call launches only a single work-group of some dimension N. What is the behavior of the FPGA board? A - Does it run each kernel fully pipelined, i.e. the will the first work-item of the (i+1)-th call be pipelined with the last work-item of the i-th call? B - Will the i-th call completely finish before the (i+1)-th call start? This case is trivial, I can always add K to the vector instead of calling K times the increment kernel. But suppose the FFT case, where I'm confronted with unrolling all the stages in the same kernel, thus calling several times barrier(CLK_LOCAL_MEM_FENCE) which reduces the kernel performance, or calling several radix-n kernels. If the hypothesis B holds, then the former strategy might be better, but if A holds then the latter should deliver a greater performance. Which one is expectable?