Manual loop unrolling VS. using #pragma unroll

Altera_Forum · ‎04-25-2014

Hello,

I noticed something strange with the offline compiler.

In the two codes below, the same operations are performed (since the# pragma directive unrolls the loops), except in the second version the accumulations are "packed" in a single operation (from a syntactic point of view).

I found that the second version is more efficient in terms of estimated throughput and logic blocks, according to aoc. In the context of my kernel, I get +5% on the throughput and -2% of logic blocks used ; it's a complex kernel, I'm guessing the difference could be more significant on smaller kernels.

I would have thought the compiler was able to unroll the MACs in the most efficient way... Am I missing something ?

Thanks

#pragma unroll 7
for(int j=-3; j<4; j++) {
    # pragma unroll 7
     for(int i=-3; i<4; i++) {
          temp += L * coeffs;
    }
}

temp  = L * coeffs    + L * coeffs     + L * coeffs
      + L * coeffs    + L * coeffs     + L * coeffs
      + L * coeffs    + L * coeffs     + L * coeffs
      + L * coeffs    + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs   + L * coeffs    + L * coeffs
      + L * coeffs;

btw, badOmen if you read this : I can't reply to your PM about my other topic because I need 10 posts. Getting close to that :)

Altera_Forum · ‎04-25-2014

I suspect in the case of the (automatic) unrolled loop those lines become cascaded so that the evaluation of one line is added to the result of the previous line. When you manually unroll it you probably ended up with more of a binary tree structure to how all the subterms were added together which might account for the smaller footprint. As for the throughput speedup I'm not sure why that is without seeing the code surrounding it but it's possible the pipeline depth is shorter in the manually unrolled case and attributing to that.

I bumped your post count to 15 so hopefully that let's you use the forum like everyone else.