I just saw that my cases using _mm256_loadu_ps show better performance than _mm_loadu_ps on corei7-4, where the latter was faster on earlier AVX platforms (in part due to the ability of ICL/icc to compile the SSE intrinsic to AVX-128).
Does this mean that advice to consider AVX-128 will soon be of only historical value? I'm ready to designate my Westmere and corei7 linux boxes as historic vehicles.
icc/ICL 14.0.1 apparently corrected the behavior (beginning with introduction of CEAN) where run-time versioning based on vector alignment never took the (AVX-256) vector branch in certain cases where CEAN notation produced effective AVX-128 code. It seems now that C code can match performance of CEAN, if equivalent pragmas are applied.
A key to getting an advantage for AVX-256 on corei7-4 appears to be to try reduced unroll. In my observation, ICL/icc don't apply automatic unrolling to loops with intrinsics, while gcc does. When not using intrinsics with ICL, I found the option 'ICL -Qunroll2' helpful. ICL used to unroll insufficiently; now it tends to unroll excessively by default for corei7-4 but probably OK for earlier CPUs.
gcc equivalent '-unroll-loops --param max-unroll-times=2'
Hoping to use last year's "VecAnalysis Python Script..." to see differences between CEAN and C with pragmas:
icl -O3 -Qstd=c99 -Qopenmp -Qansi_alias -QxHost -Qunroll2 -Zi -Qvec-report7 -c loopsv.c 2>&1 | ../vecanalysis/vecanalysis.py --annotate
reports one of the cases where CEAN vectorizes as including 1 heavy-overhead [due to variable stride] and 4 lightweight vector operations, and the C code as not vectorized (but performing better, up to loop count 1000).
Using vecanalysis.py, I found cases where AVX2 compilation chooses not to peel for alignment, where SSE4 compilation performs peeling. So the Intel compiler is expecting efficient operation on the corei7-4 without alignment peeling.
I don't think this is enough for VS2012 vectorization to be considered competitive, but it may help those who still use that compiler or the Oracle ones.
In a recent case where I removed manual unrolling (which did not conflict with optimizations performed by Intel compilers), performance improved by 50% with gnu compilers, with no loss for Intel compilers, except for a small loss on Intel(r) Xeon Phi(tm).
In version 9.1, Intel compilers had a full re-roll optimization so as to remove manual unrolling which would conflict with the compiler's own optimization. I never heard why this was dropped. It looks like you have one of those cases where manual unrolling blocks the compiler's optimization.
I've begun using the vecanalysis.py script http://software.intel.com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report although I wish it could suppress useless stuff like the "loop hierarchy" titles. I suppose it's up to us to understand and edit the script. This makes it easier to find missing or misguided compiler optimizations.
There's little point in adding manual unrolling which duplicates the unrolling performed by /Qunroll. I do make a practice of including an unroll directive along with novector directives for those cases where Intel compilers have become too aggressive in vectorization or (as in the case of stride -1) the compiler doesn't find the efficient vector instruction sequence (as development gnu compilers now do).
In case you think I'm too bullish on gnu compilers, I have a frequent problem with their failure to vectorize in parallel regions, even when using omp for simd.