In 64-ia-32-architectures-optimization-manual.pdf, at the tail end of the loop stream detector section, it says
" loop unrolling is generally preferable for performance even when it overflows the LSD capability"
My experience doesn't support this statement consistently, but it looks like some efforts have been made to help out in recent compilers.
Recent revisions of the Intel compilers have introduced more use of unconditional 16-byte alignment of loop bodies, which seems to allow more aggressive use of unroll4 compile option without ruining the performance of loops which barely fit loop stream detection when unrolled. Unroll4 seems particularly likely to help performance of loops containing vgather instructions. Greater unrolling also has been helped by the recent introduction of vectorized remainder loops and fixing the cases where the remainder loop was always used even for large loop counts. Note that the new Advisor warns about cases where significant time is spent in the remainder loop.
gcc takes a different approach, with the use of p2align 4,,10 directives. The p2 was derived from the original Intel recommendation of conditional alignment for Pentium 2. 4 means 16-byte alignment (log base 2 of 16). 10 is what gcc calls the max_skip factor, the limit on how many bytes of padding will be inserted to align the top of a loop body. As a result, gcc doesn't use up as much space for loop padding as current Intel compilers do, but unrolling by more than 2 frequently hurts performance, in part because the code size expands more with additional unrolling, particularly if max_skip is increased by rebuilding the compiler. I suppose there is a trade-off on code size and icache miss rate vs. peak loop performance. Of course, vectorization per se is a much bigger factor there than this alignment question.
gfortran doesn't have the degree of support for 32-byte data alignment which ifort has, and this affects these performance comparisons. It's not a problem for loops which assign to a single data stream, where loop peeling takes care of it. Still there are strange cases where Intel compilers work better with loops initially split and later fused during compilation (as well as cases where fusion incurs misalignment).
After some more testing on Haswell laptop, I concluded that raising gfortran max_skip factor to 12 (by editing i386.c and rebuilding gcc/gfortran/g++) is marginally better than 10, even at max-unroll-times=2. I don't know if this means alignment may be required to incur loop stream detection.
>>>but unrolling by more than 2 frequently hurts performance, in part because the code size expands more >>>
Yes that's true and moreover I think that more aggressive unrolling more than 2x will cause Port0 and Port1 contention per single when for example unrolling floating point arithmetic code.
Another case I see where performance peaks at unroll(2) is one with sequential dependency preventing vectorization in a first 2D parallel loop, where Intel compilers don't set an alignment, while the second (1D) loop is set to run in parallel with the last parallel chunk of the 2D one by omp nowait and omp single.
Superficially, it might seem that the 3-operand form of AVX scalar would facilitate handling loop-carried dependency without unrolling. Unroll(1) seems sometimes to be the same as unroll(2) with unroll(0) being the one which avoids unrolling and relies entirely on hardware bypass to shorten latency due to going all the way to memory and back with the loop carried dependency even though the dependency is declared as a local scalar.