Community
cancel
Showing results for
Did you mean:
Highlighted
Beginner
17 Views

## Problem vectorizing FORTRAN code

Hi all,

My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.

Instead of defining an array of complex matrices:

``` type su3
complex (kind=8) :: comp(3,3)
end type

type field
type (su3), allocatable :: v(:)
end type

```

that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.

``` type grid_complex
complex (kind=8), allocatable :: v(:)
end type

type field
type (grid_complex) :: comp(3,3)
end type

```

(i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:

``` module procedure su3xsu3(g,a,b)
type (field), intent (in) :: a,b
type (field), intent (inout) :: g
integer :: i, j

do concurrent (i=1:3,j=1:3)
g%comp(i,j)%v = &
a%comp(i,1)%v*b%comp(1,j)%v + &
a%comp(i,2)%v*b%comp(2,j)%v + &
a%comp(i,3)%v*b%comp(3,j)%v
end do

```

return
end procedure su3xsu3

I was assuming that now the data layout is perfectly aligned, and
therefore vectorization straightforward. The surprise is that Intel Fortran
v16 optimization report claims (ifort -opt-report5 -O3 -xavx  -align array128byte -opt-assume-safe-padding):

```LOOP BEGIN at src/group_su3.f90(56,5)
remark #25101: Loop Interchange not done due to: Original Order seems proper
remark #25452: Original Order found to be proper, but by a close margin
remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at src/group_su3.f90(56,5)
remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at src/group_su3.f90(57,8)
remark #15389: vectorization support: reference g has unaligned access
remark #15389: vectorization support: reference a has unaligned access
remark #15389: vectorization support: reference b has unaligned access
remark #15389: vectorization support: reference a has unaligned access
remark #15389: vectorization support: reference b has unaligned access
remark #15389: vectorization support: reference a has unaligned access
remark #15389: vectorization support: reference b has unaligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15399: vectorization support: unroll factor set to 2
remark #15309: vectorization support: normalized vectorization overhead 0.239
remark #15300: LOOP WAS VECTORIZED
remark #15451: unmasked unaligned unit stride stores: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 33
remark #15477: vector loop cost: 22.000
remark #15478: estimated potential speedup: 1.490
remark #15488: --- end vector loop cost summary ---
remark #25015: Estimate of max trip count of loop=128
LOOP END
LOOP END
LOOP END
```

This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.

Since I am asking to use avx instructions (SDIM size of 256 bits), that
can process at the time 2 double precision complex numbers, I would expect a
naive speedup of 2x.

Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx  -align array128byte -opt-assume-safe-padding):

```LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
remark #25101: Loop Interchange not done due to: Original Order seems proper
remark #25452: Original Order found to be proper, but by a close margin
remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8)
remark #15388: vectorization support: reference a.COMP.V has aligned access
remark #15388: vectorization support: reference c.COMP.V has aligned access
remark #15388: vectorization support: reference b.COMP.V has aligned access
remark #15388: vectorization support: reference c.COMP.V has aligned access
remark #15388: vectorization support: reference b.COMP.V has aligned access
remark #15388: vectorization support: reference c.COMP.V has aligned access
remark #15388: vectorization support: reference b.COMP.V has aligned access
remark #15305: vectorization support: vector length 2
remark #15399: vectorization support: unroll factor set to 2
remark #15300: LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 33
remark #15477: vector loop cost: 14.000
remark #15478: estimated potential speedup: 2.350
remark #15488: --- end vector loop cost summary ---
remark #25015: Estimate of max trip count of loop=128
LOOP END
LOOP END
LOOP END
```

The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.

Now the questions:

1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.

2) A more general question: Does all this shufling of data layout makes
sense for you? Could I expect an improvement by using a SoA layout if my
codes uses most the time in small matrix multiplications? Any other hints
for writing efficient code for modern architectures while being standard
conforming?

Many thanks!

A.

Tags (1)
0 Replies