vectorization of variable stride for loop

TimP · ‎01-09-2016

I noticed that Intel 16.0 and 16.0.1 C++ improved scalar optimization of variable stride to the point where it matches gcc and MSVC and makes the default auto-vectorization of such loops unproductive for host AVX, e.g.

foo( float * __restrict a, float * b);

#if _OPENMP >= 201307
#if ! __MIC__
#pragma omp simd safelen(1)
#else
#pragma omp simd
#endif
#endif
for (i = *n1; i <= i2; i__ += i3)
a += b;

yet the vectorization for MIC doesn't occur without pragma.

For MIC, vectorization of this yields > 10x performance gain, while host vectorization increases run time by 50%, yet the compiler's default choices have it backwards.

This looks like a step back in the direction of advocating pragma controlled vectorization, with the requirement for target specification. I put #if _OPENMP on so as to permit compilation by MSVC, which supports only OpenMP 2.0, in spite of supporting a fair amount of auto-vectorization in recent versions. __restrict becomes irrelevant if it's necessary to set pragmas for each target; maybe that's considered as an advantage.

Fortunately, in this case, icc controls vectorization by safelen(1).

Both icc and gcc exhibit cases where vectorization occurs in violation of safelen, as well as cases where safelen(1) is a convenient portable replacement for #pragma no vector. __restrict can be used to control auto-vectorization sometimes, but not always. #pragma no vector doesn't appeal to me in cases like this where MIC needs #pragma omp simd and where non-Intel compilers run into similar issues.

TimP · ‎12-19-2016

Intel c++ 17.1 release has corrected some cases where safelen(1) clause was ignored. This makes it more suitable as a portable replacement for #pragma novector.