Vectorisation issues with allocatable array

dhenty · ‎04-15-2021

I have the following kernel (all arrays are integers):

where(old(1:M,1:N) /= 0) &

new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
old(2:M+1,1:N), &
old(1:M,0:N-1), &
old(1:M,2:N+1) )

and it is about 3 times slower if I use allocatable arrays rather than just declaring statically (all with dimensions fixed at compile time).

I have an equivalent loop-based C version which also shows the same effect - 3 times slower with malloc'd arrays vs static arrays. However, in C there is a genuine potential pointer-aliasing issue between new and old and this can be fixed with an "ivdep" on the inner loop. In Fortran there is surely no potential aliasing issue even with allocatables so why is the compiler not vectorising? Can I apply "ivdep" to array syntax expressions like the above?

Steve_Lionel · ‎04-15-2021

Please provide a small but compilable test case. I would be interested to see what the optimization report has to say about it. The use of WHERE may also be an issue.

dhenty · ‎04-16-2021

With the appended code I get about 1.3 seconds with static arrays and 2.7 with allocatables:

dsh@laptop$ ifort --version
ifort (IFORT) 2021.2.0 20210228
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

dsh@laptop$ ifort -O3 -o wheretest wheretest.f90 # static
dsh@laptop$ time ./wheretest
new(1,1) = 575

real 0m1.265s
user 0m1.254s
sys 0m0.008s
dsh@laptop$ ifort -O3 -o wheretest wheretest.f90 # allocatables
dsh@laptop$ time ./wheretest
new(1,1) = 575

real 0m2.727s
user 0m2.722s
sys 0m0.005s

program wheretest

  implicit none

  integer, parameter :: M = 576, N = 576
  integer :: i

  integer, dimension(0:M+1,0:N+1) :: old, new

!  integer, dimension(:,:), allocatable :: old, new
!  allocate(old(0:M+1,0:N+1), new(0:M+1,0:N+1) )

  old(:,:) =  reshape( [ (mod(i,M), i=1,(M+2)*(N+2)) ], shape(old) )

  do i = 1, 4000

     where(old(1:M,1:N) /= 0) &

          new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
                                           old(2:M+1,1:N), &
                                           old(1:M,0:N-1), &
                                           old(1:M,2:N+1)    )

     old(1:M,1:N) = new(1:M,1:N)
     
  end do

  write(*,*) "new(1,1) = ", new(1,1)
  
end program wheretest

andrew_4619 · ‎04-16-2021

Are you timing the whole program? Is the time taken to allocate significant? Maybe a timing around the work might be more interesting.

dhenty · ‎04-16-2021

Initialisation is insignificant compared to the 4000 iterations of the "do" loop - doubling the trip count to 8000 doubles the elapsed time.

jimdempseyatthecove · ‎04-18-2021

Your program has a bug in it.

Line 24 copies an undefined value of new from indices of old where old contained 0.0.

I suggest you use:

...
  do i = 1, 4000

     new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
                                           old(2:M+1,1:N), &
                                           old(1:M,0:N-1), &
                                           old(1:M,2:N+1)    )

     where(old(1:M,1:N) /= 0) old(1:M,1:N) = new(1:M,1:N)
     
  end do
...

Jim Dempsey

dhenty · ‎04-19-2021

When I hastily ripped this kernel from the main program I forgot the initialisation of new which should be set to zero outside of the main loop. However, this doesn't significantly affect the result where the loop is almost twice as fast for static arrays vs allocatables.

JohnNichols · ‎04-19-2021

If you could explain what you are trying to achieve - there are reasons for the alternatives, but the best choice depends on the other things?

dhenty · ‎04-20-2021

My question is: why does identical code run twice as fast with static arrays vs allocatables. What the code does isn't really that relevant - it's just representative of simple stencil operations. It appears to be due to vectorisation because, in an equivalent C-code, adding #pragma ivdep fixes the issue for malloc'd arrays.

Barbara_P_Intel · ‎04-20-2021

Did you look at the optimization reports? The static version was vectorized.

dhenty · ‎04-22-2021

The report confirms that the static version is being vectorised:

LOOP BEGIN at wheretest.f90(20,11)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at wheretest.f90(20,11)
remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at wheretest.f90(20,11)
<Remainder loop for vectorization>
LOOP END

but with allocatables it isn't:

LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported

LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported
LOOP END
LOOP END

but I'd still like to understand why, and whether there is a directive I could use here to force vectorisation as I was able to do using #pragma ivdep in the C version.

Steve_Lionel · ‎04-22-2021

Intel Fortran supports:

!DIR$ IVDEP

See IVDEP (intel.com)

This doesn't "force" vectorization, and even the name is somewhat misleading. There are other directives you can specify that will help the compiler vectorize (Rules for General Directives that Affect DO Loops (intel.com)) In particular, look at VECTOR and NOVECTOR (intel.com)