I want to know how the branches in loop body affect the performance when unrolling. So I do some tests. Code is in the attachment.
Compiler: icc 15.0.3; options: -O3; platform: Ivy Bridge I5 3337U and Sandy Bridge E5-2670.
I use the #pragma unroll(n) to unroll the innermost loop with different unroll facts, such as 2, 4 and 8.
In both platforms, when unroll(8), the execute time increases nearly 100% than unroll(2)!
And I analysis the assembler code of the innermost loop. when unroll(8), the innermost loop body size is 288B, less than L1 I-cache (32KB), and has 58 non-branch instructions+17 condition branches. And when unroll(2), its body size is 71B, has 16 non-branch instructions+ 5 condition branches.
So when unroll(8), it decreases more loop overhead. And after unrolling, loop body is lesser than L1 I-cache, and the number of condition branches is also less than BTB(branch target buffer) capacities which has 4K entries. But why its execute time is more than unroll(2)?
And I also do some tests with gcc 4.9.2, compile with "-O3 -funroll-loops --param max-unrolled-insns=8000 --param max-average-unrolled-insns=8000 --param max-unroll-times=n". max-unroll-times=8 is like #pragma unroll(8) in icc, can control compiler unroll the innermost loop 8 times. With gcc, when unroll 8 times, its execute time still increase nearly 20% than unroll 2 times on Ivy Bridge.
Then I remove the if statements from the innermost loop. Then unroll(8), execute time decreases nearly 10% than unroll(2) with icc on Ivy Bridge, and decrease 4% with gcc on Ivy Bridge.
I think the branches in loop body affect the performance in some way when unrolling, but don't know the reasons.
Any ideas as to how the branches in loop body affect the performance when unrolling would be much appreciated, And please forgive me of my poor English expressions.
Have you overlooked Loop Stream Detector and micro-op cacheing, the two methods offered by the hardware to accelerate loops by avoiding repeated instruction decoding? There are limits on the number of branches which Loop Stream Detector can accommodate, as well as on the size of code either of those methods can handle. Loop Stream Detector may be helped by the code alignment instructions which gcc sets for gnu assembler, as well as by gcc's optimization of generated jumps.
In my experience, unroll by 4 is generally useful for floating point applications, while unrolling beyond that frequently reduces performance. The vectorized remainder feature of current Intel compilers removed many of the performance deficits incurred by unrolling moderate length loops.
icc relies more on vectorization for loop optimization than gcc does, so it may not be surprising when the latter gains more from limited unrolling of loops with branches.
In this case, Intel compiler predicts a huge loss in performance with vectorization (which gcc doesn't attempt).
Intel publicity often considers the optimizations in gcc irrelevant since they require such a verbose command line (I don't see a need for the unrolled-insns limit when setting a sane max-unroll-times).
Besides overflowing LSD and uop cache, and (as you mentioned) BTB and possibly I-cache, excessive unrolling increases loop startup overhead. It's used primarily to help keep pipelines full when using longer latency operations, which aren't present (except for memory operations) in your case.
Thanks a lot for your advices -;)
I use PAPI tools to measure uops issued by LSD(Loop Stream Detector). I find than when unroll the inner-most loop 2 times, the LSD works well. but after unroll 8 times, the LSD doesn't work anymore. The decreased loop overhead can not remove the performance deficits incurred by LSD out of work when unroll(8) compared with unroll(2). I think this is the reason why performance degrades while unrolling 8 times. Is this right??
When use icc to compile this code, icc do not perform vectorization on this code. And strangely, icc remove the loop overhead (such as loop branches) only when unroll(2), unroll(4), unroll(8). In other case, such as unroll(5) and unroll(16), icc just copy the original loop body, and never remove the useless loop branches of loop body copies. The original loop body have 3 branches , one of them is loop branch. When unroll 8 times use icc, the unrolled loop body have 17 branches. But when unroll 16 times , the unrolled loop body have 49 branches, icc do not remove extra loop branches.
Compare gcc with icc, when unroll times <= 4, the code compiled by icc has higher performance than code compiled gcc. But when unroll times more than 4, gcc does between than icc. I thank this is because the loop body that generated by icc contains more instructions than gcc, means that icc does not do a good optimization on the unrolled loop body. It does not relate to vectorization. Maybe I have misunderstood you opinions?
I wasn't aware that PAPI would be effective in verifying the action of LSD, but this seems to confirm what I expected.
I see from current Intel documentation that LSD can work only when a loop fits in micro-op cache, but there are several additional restrictive conditions. My experience seems to indicate that unrolling a loop beyond where LSD works isn't as bad as it was before the micro-op cache was introduced, with the net result that unroll by 4 often works well regardless of whether it is within or beyond the limits of LSD.
I think we agree that gcc is making more effort than icc to fit loops with branches into LSD, by minimizing the number of branches and setting conditional code alignments. These measures are not as critical on the current CPUs as they used to be, presumably because the LSD capacity has increased, and the micro-op cache avoids abrupt loss of performance when missing LSD.
The case you posted will vectorize with icc if #pragma simd and AVX2 are set, but this does not appear to be useful. If icc considers vectorization, it will usually avoid it on the ground of "seems inefficient" if static analysis shows it is too slow. #pragma simd or #pragma vector, or CEAN syntax, will cause icc to suspend the efficiency analysis. Only in the most recent compilers did CEAN consistently suspend efficiency analysis, although I got a response from Intel that it was always intended to do so.
As you're probably aware, vectorization of cases with branches is generally done by speculative evaluation of all branches, followed by blend merging of the various results. Sometimes this involves masked store, which is a relatively slow instruction. As the blend instructions aren't in the default instruction set, the compiler has to be set to a newer instruction set.
I haven't seen any case with branches where unroll by more than 4 is useful with either icc or gcc on current CPUs. The aggressiveness of gcc -funroll-loops without max-unroll-times seems to be a relic of CPUs before Nehalem where Loop Stream Detector worked with a much more limited number of cases.