Solved: Data alignment and vectorization question

mriedman · ‎07-22-2011

The following code example (kernel ofa Red-Black SOR solver)is partially vectorized by ifort 11.1.056. While the arithmetic is done with true SIMD instructions (mulpd, subpd), the loads are still single operand (movsd, movhpd).

Obviouslythe access pattern for all arrays is perfectly sequential. That should allow vectorization for the loads too.

What could be done to get fully vectorized execution ?I assume it's an alignment issue. For the given example, even array copies may be acceptable as the data will be reused many times (hundreds of iterations). Thanks for any insight into this matter.

[fortran]subroutine rb_row_a(nx, ny, iy, sps, spe, x_in, x_out, a, b, omega, errff)

  implicit none

  integer, intent(in) :: nx
  integer, intent(in) :: ny
  integer, intent(in) :: iy
  integer, intent(in) :: sps
  integer, intent(in) :: spe
  real, dimension(    0:nx,0:ny), intent(in)    :: x_in
  real, dimension(    0:nx,0:ny), intent(inout) :: x_out
  real, dimension(2:5,0:nx,0:ny), intent(in)    :: a
  real, dimension(    0:nx,0:ny), intent(in)    :: b
  real,                           intent(in)    :: omega
  real,                           intent(inout) :: errff

  integer  :: ix
  real     :: c
  real     :: xr
  real     :: xx
  real,    dimension(sps:spe) :: errs

!dec$ ivdep
  do ix = sps, spe
    xx = x_out(ix,iy)
    xr = 1.0 / xx
    c = b(ix,iy) - xx                &
      - a(2,ix,iy) * x_in(ix  ,iy-1) &
      - a(3,ix,iy) * x_in(ix-1,iy  ) &
      - a(4,ix,iy) * x_in(ix  ,iy  ) &
      - a(5,ix,iy) * x_in(ix  ,iy+1)
    c = omega * c
    errs(ix) = abs(c * xr)
    x_out(ix,iy) = xx + c
  end do  
  errff=max(maxval(errs), errff)

  return
end
[/fortran]

TimP · ‎07-24-2011

[bash]...
real, dimension(0:nx,2:5,0:ny), intent(in)    :: a[/bash]
 ...

      do ix = sps, spe
      !dir$ distribute point
        xx = x_out(ix,iy)
        c = b(ix,iy) - xx                &
          - a(ix,2,iy) * x_in(ix  ,iy-1) &
          - a(ix,3,iy) * x_in(ix-1,iy  ) &
          - a(ix,4,iy) * x_in(ix  ,iy  ) &
          - a(ix,5,iy) * x_in(ix  ,iy+1)
        c = omega * c
        errs(ix) = abs(c / xx)
        x_out(ix,iy) = xx + c
      errff=max((errs(ix)), errff)
      end do

If the original question was meant to be how to persuade ifort to vectorize without splitting the loop, it appears you would require the DISTRIBUTE POINT directive. It does appear that the single vectorized loop ought to perform better than the split loops.
When you see multiple PARTIAL LOOP VECTORIZED diagnostics (in the absence of the DISTRIBUTE POINT to prevent splitting), in a case such as this where there is a partial vector loop for each array assignment, you can be assured that all partials are vectorized even without looking at the generated code.
If you wish to avoid scalar loads for the a() elements, you must swap the subscript order and compile for Nehalem (-xSSE4.2) (and hope that no Nehalem new instructions creep in). Clearly, the compiler is aware that the split unaligned loads are faster for Harpertown. In this case, you could take advantage of that when running on Westmere, by specifying the SSE4.1 (HTN) or SSE2 optimization.

View solution in original post

mecej4 · ‎07-22-2011

Have you allowed the compiler to generate vectorization reports? There may be useful information in those reports.

> Obviouslythe access pattern for all arrays is perfectly sequential.

The vectorizer may have difficulty in seeing that statement as being applicable to x_in, when it sees the second index having values iy - 1, iy and iy +1 in the loop.

> I assume it's an alignment issue.

That may be possible. I remember reading somewhere that later releases of CPUs were tweaked to make the penalty for unaligned access less significant. The penalty is quite high on IA64 (the CPU versions that I have tried, at least).

Please state your OS version, the host and target CPUs, and specify the compiler flags in effect. I am curious about your questions.

mriedman · ‎07-22-2011

Host and target OS are RH5, Harpertown CPUs, compiler flags and output:

> ifort -openmp -r8 -O3 -axT -g -S -vec-report rb.f90

rb.f90(32): (col. 5) remark: LOOP WAS VECTORIZED.
rb.f90(36): (col. 13) remark: PARTIAL LOOP WAS VECTORIZED.

So the compiler does consider the loop vectorized even if the loads are not.

The loop did not vectorize at all until I moved the max() operation out of the loop, which certainly costs extra memory accesses to array errs. Apparently reductions generally inhibit vectorization.

TimP · ‎07-22-2011

If you want code specifically targeted to HTN, you will need -xSSE4.1. You asked for code paths targeted to Woodcrest and pre-SSE2 CPUs, so the compiler is correct in splitting the unaligned vectorized loads, as the 2 64-bit loads would be faster. In practice, I have found the split unaligned loads faster even on the current 6-core CPUs. You may see an advantage for movups on Nehalem or Istanbul, when cache locality is high.
Your chances for vectorization of the max() might be better if you didn't write in an explicit inversion; as you have promoted your data to double precision, vector divide would be faster anyway. You could experiment with !dir$ vector always, which eliminates the compiler's cost analysis.
If you need the IVDEP for vectorization, you should check as to why that is so.

Steven_L_Intel1 · ‎07-22-2011

Version 12.0 says "insufficient work" to vectorize the second loop, and I could not find a way to convince it to do so. An update later this year will vectorize it.

jimdempseyatthecove · ‎07-22-2011

Can you experiment with

real, pointer :: x_in_iy_m1(:), x_in_m1_iy(:),x_in_iy(:),x_in_iy_p1(:)
x_in_iy_m1 => x_in(sps,iy-1)
x_in_m1_iy => x_in(sps-1,iy)
x_in_iy => x_in(sps,iy)
x_in_iy_p1 => x_in(sps,iy+1)
!dec$ ivdep
do ix=0,spe-sps-1
xx = x_out(ix+sps, iy)
...
- a(2,ix+sps,iy) * x_in_iy_m1(ix) &
- a(3,ix+sps,iy) * x_m1_iy(ix) &
- a(4,ix+sps,iy) * x_in_iy(ix) &
- a(5,ix+sps,iy) * x_in_iy_p1(ix)
...
x_out(ix+sps,iy) = ...

If you do not like using the pointers then construct a subroutine that passes in the array slice references (1D array slice reference from 2D array). There should be little overhead in constructing the array descriptor. Hopefully no copying of slice, verify this as you cannot always tell if IVF will make a copy of the array slice or not.

Jim Dempsey

Steven_L_Intel1 · ‎07-22-2011

It's the assignment to errff that is not vectorized here. The main loop is vectorized.

jimdempseyatthecove · ‎07-24-2011

Then (at least for Release build) remove the local

real, dimension(sps:spe) :: errs

replace

errs(ix) = abs(c * xr)

with
errff=max(abs(c * xr), errff)

And remove the
errf = max(maxval(errs),errf)

If you need the errs array for debugging purposes then conditionalize the code to provide it in debug mode

Jim Dempsey

TimP · ‎07-24-2011

[bash]...
real, dimension(0:nx,2:5,0:ny), intent(in)    :: a[/bash]
 ...

      do ix = sps, spe
      !dir$ distribute point
        xx = x_out(ix,iy)
        c = b(ix,iy) - xx                &
          - a(ix,2,iy) * x_in(ix  ,iy-1) &
          - a(ix,3,iy) * x_in(ix-1,iy  ) &
          - a(ix,4,iy) * x_in(ix  ,iy  ) &
          - a(ix,5,iy) * x_in(ix  ,iy+1)
        c = omega * c
        errs(ix) = abs(c / xx)
        x_out(ix,iy) = xx + c
      errff=max((errs(ix)), errff)
      end do

If the original question was meant to be how to persuade ifort to vectorize without splitting the loop, it appears you would require the DISTRIBUTE POINT directive. It does appear that the single vectorized loop ought to perform better than the split loops.
When you see multiple PARTIAL LOOP VECTORIZED diagnostics (in the absence of the DISTRIBUTE POINT to prevent splitting), in a case such as this where there is a partial vector loop for each array assignment, you can be assured that all partials are vectorized even without looking at the generated code.
If you wish to avoid scalar loads for the a() elements, you must swap the subscript order and compile for Nehalem (-xSSE4.2) (and hope that no Nehalem new instructions creep in). Clearly, the compiler is aware that the split unaligned loads are faster for Harpertown. In this case, you could take advantage of that when running on Westmere, by specifying the SSE4.1 (HTN) or SSE2 optimization.

mriedman · ‎07-25-2011

Thanks a lot, Tim and Jim,

indeed I can merge the 2 loops anddiscard the intermediate errs array if the explicit inversion is removed. That gets me a cleaner and slightly faster (~5%) code.The divide and the max() operations are then vectorized as they should be.
The scalar loads remain even if I compile with -xSSE4.1, this option just changes the ordering of instructions, but it does not useany new instructions.Assuming that there is no big penalty for scalar loads this solution is good enough for now.
The ivdep directive is not needed nor is the vector directive. I had to introduce ivdep in the original code version where the red and black data (x_in, x_out)is in the same array, just separated by an extra dimension.

I'm going to makefurther tests with ifort12 and AVX. Shouldn't vectorized loads be desirable with AVX ? Any experience someone ?

TimP · ‎07-25-2011

I saw the same, that the compiler chose not to use movups for unaligned loads under -xSSE4.1. As I said, the unaligned load is expected to take longer than than the split scalar loads, on an HTN CPU (and, in my experience, on WSM also).
You can try the AVX options, simply to generate the .s file, without attempting to run. As is expected, the compiler uses avx-128 unaligned pairs of loads for my modified version of your source code (1 avx-128 unaligned store pair and 1 avx-256 aligned store). Only data which resides in L1 cache and works with avx-256 aligned loads could be expected to benefit from 256-bit loads on the Sandy Bridge.

mriedman · ‎07-25-2011

The AVX assembly code looks really weird with the original dimensioning of array a. Lots of operations needed to pack things into the 256-bit registers. But at least the loop is unrolled by 4 instead of 2 and the ymm registers are really used.
Assembly of your modified version with swapped dimensions looks much more straightforward.

My final question is: What can a programmer do togetaligned instead of unaligned loads ? How does the compiler distinguish data accessand generatethe different load types?Would it helpto copy data into a static array ?Or is this finally something that can't be controlled from Fortran source ?

TimP · ‎07-25-2011

As your code explicitly accesses array sections of origin differing by 1, at least one of those has to be misaligned. Since the alignment for stored arrays is more important, the compiler builds in a remainder loop to adjust one of the stored arrays for alignment, thus the use of avx-256 alignment for one array. Among the ways to allow the compiler to see the relative alignment of arrays would be COMMON, or, in principle, MODULE arrays.
In a situation where you use the same array repeatedly in an inner loop, and can copy it to an aligned stride 1 array in the outer loop, that can easily improve efficiency. If you use a local array, the compiler should attempt to make it aligned.