unaligned loads avx-128 vs. -256

TimP · ‎01-04-2014

I just saw that my cases using _mm256_loadu_ps show better performance than _mm_loadu_ps on corei7-4, where the latter was faster on earlier AVX platforms (in part due to the ability of ICL/icc to compile the SSE intrinsic to AVX-128).

Does this mean that advice to consider AVX-128 will soon be of only historical value? I'm ready to designate my Westmere and corei7 linux boxes as historic vehicles.

icc/ICL 14.0.1 apparently corrected the behavior (beginning with introduction of CEAN) where run-time versioning based on vector alignment never took the (AVX-256) vector branch in certain cases where CEAN notation produced effective AVX-128 code. It seems now that C code can match performance of CEAN, if equivalent pragmas are applied.

A key to getting an advantage for AVX-256 on corei7-4 appears to be to try reduced unroll. In my observation, ICL/icc don't apply automatic unrolling to loops with intrinsics, while gcc does. When not using intrinsics with ICL, I found the option 'ICL -Qunroll2' helpful. ICL used to unroll insufficiently; now it tends to unroll excessively by default for corei7-4 but probably OK for earlier CPUs.

gcc equivalent '-unroll-loops --param max-unroll-times=2'

Hoping to use last year's "VecAnalysis Python Script..." to see differences between CEAN and C with pragmas:

icl -O3 -Qstd=c99 -Qopenmp -Qansi_alias -QxHost -Qunroll2 -Zi -Qvec-report7 -c loopsv.c 2>&1 | ../vecanalysis/vecanalysis.py --annotate

reports one of the cases where CEAN vectorizes as including 1 heavy-overhead [due to variable stride] and 4 lightweight vector operations, and the C code as not vectorized (but performing better, up to loop count 1000).

TimP · ‎01-06-2014

Using vecanalysis.py, I found cases where AVX2 compilation chooses not to peel for alignment, where SSE4 compilation performs peeling. So the Intel compiler is expecting efficient operation on the corei7-4 without alignment peeling.

I don't think this is enough for VS2012 vectorization to be considered competitive, but it may help those who still use that compiler or the Oracle ones.

SergeyKostrov · ‎01-10-2014

>>...When not using intrinsics with ICL, I found the option 'ICL -Qunroll2' helpful... I also use automatic unrolling with -Qunroll8 and, in some cases, -Qunroll16 options. C/C++ codes I'm working with are too generic and Intel intrinsics are Not allowed because of software portability constraints.

SergeyKostrov · ‎01-10-2014

>>...it may help those who still use that compiler or the Oracle ones... What do you mean? Does Oracle have a C/C++ compiler?

TimP · ‎01-10-2014

Oracle compilers for ia linux and sparc are free.

SergeyKostrov · ‎01-13-2014

>>...When not using intrinsics with ICL, I found the option 'ICL -Qunroll2' helpful... I checked project settings and most of the time I use /Qunroll:8 and /Qunroll-aggressive Intel C++ compiler options and I never try to combine Auto-Unrolling with Manual-Unrolling. This is why: [ Test-Case 1 - Only /Qunroll:8 and /Qunroll-aggressive options used ] ... > _TMatrixSet Methods < Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 550.75000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 554.75000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 554.50000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 551.00000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 550.75000 ticks Add - 1D-based - Passed ... [ Test-Case 2 - /Qunroll:8, /Qunroll-aggressive options and Manual-Unrolling used ] ... > _TMatrixSet Methods < Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 3386.75000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 3351.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 3347.75000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 3355.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 3339.75000 ticks ... Perfermance drops significantly in the Test-Case 2 when Auto-Unrolling combined with Manual-Unrolling.

TimP · ‎01-14-2014

In a recent case where I removed manual unrolling (which did not conflict with optimizations performed by Intel compilers), performance improved by 50% with gnu compilers, with no loss for Intel compilers, except for a small loss on Intel(r) Xeon Phi(tm).

In version 9.1, Intel compilers had a full re-roll optimization so as to remove manual unrolling which would conflict with the compiler's own optimization. I never heard why this was dropped. It looks like you have one of those cases where manual unrolling blocks the compiler's optimization.

I've begun using the vecanalysis.py script http://software.intel.com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report although I wish it could suppress useless stuff like the "loop hierarchy" titles. I suppose it's up to us to understand and edit the script. This makes it easier to find missing or misguided compiler optimizations.

There's little point in adding manual unrolling which duplicates the unrolling performed by /Qunroll. I do make a practice of including an unroll directive along with novector directives for those cases where Intel compilers have become too aggressive in vectorization or (as in the case of stride -1) the compiler doesn't find the efficient vector instruction sequence (as development gnu compilers now do).

In case you think I'm too bullish on gnu compilers, I have a frequent problem with their failure to vectorize in parallel regions, even when using omp for simd.

SergeyKostrov · ‎01-24-2014

>>...There's little point in adding manual unrolling which duplicates the unrolling performed by /Qunroll... It is applicable only for Intel C++ compilers ( recent versions ) and manual unrolling always improves performance of older legacy C++ compilers. Even Microsoft C++ compiler for embedded platforms, like Windows CE, Windows Mobile Phone, etc, benefits from manual unrolling.

SergeyKostrov · ‎01-24-2014

>>...There's little point in adding manual unrolling which duplicates the unrolling performed by /Qunroll... Another example is a classic matrix multiplication processing ( 3-for-loops based ) and a manual unrolling of the 2nd-for-loop, instead of 3rd-for-loop, also improves performance.