Problem vectorizing FORTRAN code

Alberto_R_1 · ‎04-18-2016

Hi all,

My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.

Instead of defining an array of complex matrices:

 type su3
     complex (kind=8) :: comp(3,3)
  end type

  type field
    type (su3), allocatable :: v(:)
  end type

that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.

 type grid_complex
    complex (kind=8), allocatable :: v(:)
  end type
 
  type field
    type (grid_complex) :: comp(3,3)
  end type

(i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:

 module procedure su3xsu3(g,a,b)
    type (field), intent (in) :: a,b
    type (field), intent (inout) :: g
    integer :: i, j
    
    do concurrent (i=1:3,j=1:3)
       g%comp(i,j)%v = &
            a%comp(i,1)%v*b%comp(1,j)%v + &
            a%comp(i,2)%v*b%comp(2,j)%v + &
            a%comp(i,3)%v*b%comp(3,j)%v
    end do

return
end procedure su3xsu3

I was assuming that now the data layout is perfectly aligned, and
therefore vectorization straightforward. The surprise is that Intel Fortran
v16 optimization report claims (ifort -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/group_su3.f90(56,5)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at src/group_su3.f90(57,8)
         remark #15389: vectorization support: reference g has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15381: vectorization support: unaligned access used inside loop body
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15309: vectorization support: normalized vectorization overhead 0.239
         remark #15300: LOOP WAS VECTORIZED
         remark #15450: unmasked unaligned unit stride loads: 6 
         remark #15451: unmasked unaligned unit stride stores: 1 
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 33 
         remark #15477: vector loop cost: 22.000 
         remark #15478: estimated potential speedup: 1.490 
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=128
      LOOP END
   LOOP END
LOOP END

This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.

Since I am asking to use avx instructions (SDIM size of 256 bits), that
can process at the time 2 double precision complex numbers, I would expect a
naive speedup of 2x.

Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8)
         remark #15388: vectorization support: reference a.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15300: LOOP WAS VECTORIZED
         remark #15448: unmasked aligned unit stride loads: 6 
         remark #15449: unmasked aligned unit stride stores: 1 
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 33 
         remark #15477: vector loop cost: 14.000 
         remark #15478: estimated potential speedup: 2.350 
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=128
      LOOP END
   LOOP END
LOOP END

The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.

Now the questions:

1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.

2) A more general question: Does all this shufling of data layout makes
sense for you? Could I expect an improvement by using a SoA layout if my
codes uses most the time in small matrix multiplications? Any other hints
for writing efficient code for modern architectures while being standard
conforming?

Many thanks!

A.