Aliasing concern with variable stride creating temporary and slow vectorization

TimP · ‎03-08-2014

I found that ifort (and gfortran) create a temporary for the following array assignment:

a(1:n-inc:inc)= a(inc+1:n:inc)+b(1:n-inc:inc)

presumably because of the possibility that inc is less than zero. The result is stored in a stride 1 temporary and then copied to the destination, all reporting vectorization.

If I write

do i= 1,n-inc,inc
a(i)= a(i+inc)+b(i)
enddo

ifort decides not to vectorize with /QxAVX2. Apparently, that's a good decision, as adding a !dir$ simd to produce simulated gather-scatter makes it slower, even in the case inc==1 (but not as slow as the array assignment with temporary).

Intel's vecanalysis script:

http://software.intel.com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report

reports heavy-overhead vectorization.

Just one more data point in the continuing question about marginal vectorization.

Steven_L_Intel1 · ‎03-08-2014

Hi, Tim. That's interesting - is there some action needed on our part, or was this just an observation you wanted to get out there?

TimP · ‎03-09-2014

I wanted to give an example of the importance of digging up the detailed compiler reports and reading between the lines to see when vectorization isn't the whole answer. It was new to me to see the same temporary array problem in gfortran as in ifort; I used to find them by noting when gfortran performed better.

By the way, the Fortran 77 version with !dir$ simd is the one which is needed for performance on MIC. Having to put #ifdef __MIC__ around simd and novector directives detracts somewhat from the value of ifort in being able to run the same source code efficiently on host and coprocessor.