About Loop Pipelining

Altera_Forum · ‎02-22-2015

Hello everyone,

I am wondering if the AOC compiler would automatically infer loop pipelining for non-task multi-threaded kernels as well? In addition, does the "shift register inference" optimization for task kernels also work for multi-threaded kernels?

About inline function for OpenCL Kernel. I am wondering if it is better to pass in variable normally or by reference? Will pass in variable by value cause the compiler to generate extra registers to hold the values, and will pass variable by reference help save registers?

Thanks!

Altera_Forum · ‎02-24-2015

--- Quote Start ---

Hello everyone,

I am wondering if the AOC compiler would automatically infer loop pipelining for non-task multi-threaded kernels as well? In addition, does the "shift register inference" optimization for task kernels also work for multi-threaded kernels?

About inline function for OpenCL Kernel. I am wondering if it is better to pass in variable normally or by reference? Will pass in variable by value cause the compiler to generate extra registers to hold the values, and will pass variable by reference help save registers?

Thanks!

--- Quote End ---

Loop pipelining and shift register inference are for task kernels only.

Passing arguments by value or by reference are expected to result in the same hardware.

Altera_Forum · ‎02-25-2015

Thank you for the reply!

Is it possible to manually turn loop into pipeline though? Or is this not worth it since the work-items are already executed using pipeline?

Altera_Forum · ‎02-25-2015

Loops are pipelined in task or NDRange kernels. The difference is that in a task kernel, the datapath of the loop will contain multiple "iterations" in flight. In an NDRange kernel, multiple work-items will be in-flight. So, if there are no stalls/etc, the loop datapath should be fully utilized by the work-items. For example, if the loop body takes 100-cycles in total, ideally 100 work-items should be in flight inside the loop.

Altera_Forum · ‎02-25-2015

Thanks! So basically the "depth" of the pipeline does not depend on how many loop iteration there are or number of work-items in a workgroup, but instead purely depends on how many cycle the loop body actually executes?

What if inside the loop body there are slow command such as floating point divider, and also have much faster command such as add/sub. Is the pipeline going to run at the speed of slowest command? Does AOC try to balance the pipeline stages through "fp-relaxed" or should the balancing be done manually?

In addition, sorry to sidetrack this thread, I am also wondering if the AOC could generate merged Multiply-add functions like the GPUs ,or are floating point addition always only implemented with LUTs and Registers?

Altera_Forum · ‎02-26-2015

--- Quote Start ---

Thanks! So basically the "depth" of the pipeline does not depend on how many loop iteration there are or number of work-items in a workgroup, but instead purely depends on how many cycle the loop body actually executes?

What if inside the loop body there are slow command such as floating point divider, and also have much faster command such as add/sub. Is the pipeline going to run at the speed of slowest command? Does AOC try to balance the pipeline stages through "fp-relaxed" or should the balancing be done manually?

In addition, sorry to sidetrack this thread, I am also wondering if the AOC could generate merged Multiply-add functions like the GPUs ,or are floating point addition always only implemented with LUTs and Registers?

--- Quote End ---

Yes, the depth of the pipeline does not depend on the number of iterations (unless you unroll the loop), but mostly the latency of the instructions. However, the compiler sometimes adjusts the depth according to the number of work-items to further optimize the pipeline.

The pipeline balancing is automatically done by the compiler so "slow" operations can be done in parallel with "fast" operations. -fp-relaxed is just an additional flag that tells the compiler that it can re-order floating point operations for further balancing.

With OpenCL, stalls are the main concern because throughput is achieved via work-items. If the pipeline of your kernel is N-cycles, but there are no stalls, a single work-item will enter and exit the pipeline every cycle. The latency of the operations is not a big concern unless they stall the pipeline.

FPGA "instructions" are different than GPU instructions. A multiply and an add can be done in a single cycle on the FPGA; but this affects the frequency of your circuit. The compiler tries to optimize for maximum frequency, so it may choose break up the multiply and the add, or not depending on these optimizations. You do not need to worry about the efficiency of your multiplies and adds.

Altera_Forum · ‎02-26-2015

Thank you for the clarification!

Altera_Forum · ‎08-07-2015

Loops are pipelined in task or NDRange kernels. The difference is that in a task kernel, the datapath of the loop will contain multiple "iterations" in flight. In an NDRange kernel, multiple work-items will be in-flight. So, if there are no stalls/etc, the loop datapath should be fully utilized by the work-items. For example, if the loop body takes 100-cycles in total, ideally 100 work-items should be in flight inside the loop.

Thanks for the explanation. What if an NDRange kernel contains a loop with unknown loop bound? In this case, how the loop is pipelined to contain multiple iterations in flight? Is there a default loop unroll factor being used to generate the loop pipeline?

Altera_Forum · ‎08-10-2015

The same question also applies to an NDRange kernel which contains a loop with a high loop bound. In such a case, how is the kernel synthesized into hardware? Thanks

--- Quote Start ---

Loops are pipelined in task or NDRange kernels. The difference is that in a task kernel, the datapath of the loop will contain multiple "iterations" in flight. In an NDRange kernel, multiple work-items will be in-flight. So, if there are no stalls/etc, the loop datapath should be fully utilized by the work-items. For example, if the loop body takes 100-cycles in total, ideally 100 work-items should be in flight inside the loop.

Thanks for the explanation. What if an NDRange kernel contains a loop with unknown loop bound? In this case, how the loop is pipelined to contain multiple iterations in flight? Is there a default loop unroll factor being used to generate the loop pipeline?

--- Quote End ---

Altera_Forum · ‎08-13-2015

I think it's not based on unroll factors but instead based on how many operations are in the loop body. For example: if you have 5 operations in the loop body, the pipeline will have 5 stages. When you unroll the loop, other optimizations may apply, for example: you may have a reduction tree for things like sum+=b[i]; but if there are dependencies a very long and inefficient pipeline may be designed.

Disclaimer: I am not working for Altera so this answer is compiled from my understanding of reply from Outku and AOCL Optimization guide.