- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.
Instead of defining an array of complex matrices:
type su3 complex (kind=8) :: comp(3,3) end type type field type (su3), allocatable :: v(:) end type
that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.
type grid_complex complex (kind=8), allocatable :: v(:) end type type field type (grid_complex) :: comp(3,3) end type
(i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:
module procedure su3xsu3(g,a,b) type (field), intent (in) :: a,b type (field), intent (inout) :: g integer :: i, j do concurrent (i=1:3,j=1:3) g%comp(i,j)%v = & a%comp(i,1)%v*b%comp(1,j)%v + & a%comp(i,2)%v*b%comp(2,j)%v + & a%comp(i,3)%v*b%comp(3,j)%v end do
return
end procedure su3xsu3
I was assuming that now the data layout is perfectly aligned, and
therefore vectorization straightforward. The surprise is that Intel Fortran
v16 optimization report claims (ifort -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):
LOOP BEGIN at src/group_su3.f90(56,5) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(56,5) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(57,8) remark #15389: vectorization support: reference g has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15309: vectorization support: normalized vectorization overhead 0.239 remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 6 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 33 remark #15477: vector loop cost: 22.000 remark #15478: estimated potential speedup: 1.490 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=128 LOOP END LOOP END LOOP END
This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.
Since I am asking to use avx instructions (SDIM size of 256 bits), that
can process at the time 2 double precision complex numbers, I would expect a
naive speedup of 2x.
Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):
LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8) remark #15388: vectorization support: reference a.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 6 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 33 remark #15477: vector loop cost: 14.000 remark #15478: estimated potential speedup: 2.350 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=128 LOOP END LOOP END LOOP END
The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.
Now the questions:
1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.
2) A more general question: Does all this shufling of data layout makes
sense for you? Could I expect an improvement by using a SoA layout if my
codes uses most the time in small matrix multiplications? Any other hints
for writing efficient code for modern architectures while being standard
conforming?
Many thanks!
A.
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page