Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® Fortran Compiler
- Problem vectorizing FORTRAN code

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Alberto_R_1

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-18-2016
01:45 AM

Problem vectorizing FORTRAN code

Hi all,

My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.

Instead of defining an array of complex matrices:

type su3 complex (kind=8) :: comp(3,3) end type type field type (su3), allocatable :: v(:) end type

that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.

type grid_complex complex (kind=8), allocatable :: v(:) end type type field type (grid_complex) :: comp(3,3) end type

(i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:

module procedure su3xsu3(g,a,b) type (field), intent (in) :: a,b type (field), intent (inout) :: g integer :: i, j do concurrent (i=1:3,j=1:3) g%comp(i,j)%v = & a%comp(i,1)%v*b%comp(1,j)%v + & a%comp(i,2)%v*b%comp(2,j)%v + & a%comp(i,3)%v*b%comp(3,j)%v end do

return

end procedure su3xsu3

I was assuming that now the data layout is perfectly aligned, and

therefore vectorization straightforward. The surprise is that Intel Fortran

v16 optimization report claims (ifort -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(56,5) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(57,8) remark #15389: vectorization support: reference g has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15309: vectorization support: normalized vectorization overhead 0.239 remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 6 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 33 remark #15477: vector loop cost: 22.000 remark #15478: estimated potential speedup: 1.490 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=128 LOOP END LOOP END LOOP END

This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.

Since I am asking to use avx instructions (SDIM size of 256 bits), that

can process at the time 2 double precision complex numbers, I would expect a

naive speedup of 2x.

Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8) remark #15388: vectorization support: reference a.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 6 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 33 remark #15477: vector loop cost: 14.000 remark #15478: estimated potential speedup: 2.350 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=128 LOOP END LOOP END LOOP END

The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.

Now the questions:

1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.

2) A more general question: Does all this shufling of data layout makes

sense for you? Could I expect an improvement by using a SoA layout if my

codes uses most the time in small matrix multiplications? Any other hints

for writing efficient code for modern architectures while being standard

conforming?

Many thanks!

A.