Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28445 Discussions

Vectorisation issues with allocatable array

dhenty
Beginner
977 Views

I have the following kernel (all arrays are integers):

where(old(1:M,1:N) /= 0) &

new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
old(2:M+1,1:N), &
old(1:M,0:N-1), &
old(1:M,2:N+1) )

and it is about 3 times slower if I use allocatable arrays rather than just declaring statically (all with dimensions fixed at compile time).

I have an equivalent loop-based C version which also shows the same effect - 3 times slower with malloc'd arrays vs static arrays. However, in C there is a genuine potential pointer-aliasing issue between new and old and this can be fixed with an "ivdep" on the inner loop. In Fortran there is surely no potential aliasing issue even with allocatables so why is the compiler not vectorising? Can I apply "ivdep" to array syntax expressions like the above?

 

0 Kudos
11 Replies
Steve_Lionel
Honored Contributor III
953 Views

Please provide a small but compilable test case. I would be interested to see what the optimization report has to say about it. The use of WHERE may also be an issue.

0 Kudos
dhenty
Beginner
933 Views

With the appended code I get about 1.3 seconds with static arrays and 2.7 with allocatables:

dsh@laptop$ ifort --version
ifort (IFORT) 2021.2.0 20210228
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

dsh@laptop$ ifort -O3 -o wheretest wheretest.f90  # static
dsh@laptop$ time ./wheretest
new(1,1) = 575

real 0m1.265s
user 0m1.254s
sys 0m0.008s
dsh@laptop$ ifort -O3 -o wheretest wheretest.f90  # allocatables
dsh@laptop$ time ./wheretest
new(1,1) = 575

real 0m2.727s
user 0m2.722s
sys 0m0.005s

program wheretest

  implicit none

  integer, parameter :: M = 576, N = 576
  integer :: i

  integer, dimension(0:M+1,0:N+1) :: old, new

!  integer, dimension(:,:), allocatable :: old, new
!  allocate(old(0:M+1,0:N+1), new(0:M+1,0:N+1) )

  old(:,:) =  reshape( [ (mod(i,M), i=1,(M+2)*(N+2)) ], shape(old) )

  do i = 1, 4000

     where(old(1:M,1:N) /= 0) &

          new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
                                           old(2:M+1,1:N), &
                                           old(1:M,0:N-1), &
                                           old(1:M,2:N+1)    )

     old(1:M,1:N) = new(1:M,1:N)
     
  end do

  write(*,*) "new(1,1) = ", new(1,1)
  
end program wheretest

 

0 Kudos
andrew_4619
Honored Contributor II
925 Views

Are you timing the whole program? Is the time taken to allocate significant? Maybe a  timing around the work might be more interesting. 

0 Kudos
dhenty
Beginner
922 Views

Initialisation is insignificant compared to the 4000 iterations of the "do" loop - doubling the trip count to 8000 doubles the elapsed time.

0 Kudos
jimdempseyatthecove
Honored Contributor III
879 Views

Your program has a bug in it.

Line 24 copies an undefined value of new from indices of old where old contained 0.0.

I suggest you use:

...
  do i = 1, 4000

     new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
                                           old(2:M+1,1:N), &
                                           old(1:M,0:N-1), &
                                           old(1:M,2:N+1)    )

     where(old(1:M,1:N) /= 0) old(1:M,1:N) = new(1:M,1:N)
     
  end do
...

Jim Dempsey

0 Kudos
dhenty
Beginner
871 Views

When I hastily ripped this kernel from the main program I forgot the initialisation of new which should be set to zero outside of the main loop. However, this doesn't significantly affect the result where the loop is almost twice as fast for static arrays vs allocatables.

 

 

0 Kudos
JohnNichols
Valued Contributor III
855 Views

If you could explain what you are trying to achieve - there are reasons for the alternatives, but the best choice depends on the other things?

0 Kudos
dhenty
Beginner
827 Views

My question is: why does identical code run twice as fast with static arrays vs allocatables. What the code does isn't really that relevant - it's just representative of simple stencil operations. It appears to be due to vectorisation because, in an equivalent C-code, adding #pragma ivdep fixes the issue for malloc'd arrays.

0 Kudos
Barbara_P_Intel
Moderator
814 Views

Did you look at the optimization reports?  The static version was vectorized.

 

0 Kudos
dhenty
Beginner
786 Views

The report confirms that the static version is being vectorised:

LOOP BEGIN at wheretest.f90(20,11)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at wheretest.f90(20,11)
remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at wheretest.f90(20,11)
<Remainder loop for vectorization>
LOOP END

 but with allocatables it isn't:

LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported

LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported
LOOP END
LOOP END

but I'd still like to understand why, and whether there is a directive I could use here to force vectorisation as I was able to do using #pragma ivdep in the C version.

0 Kudos
Steve_Lionel
Honored Contributor III
774 Views

Intel Fortran supports:

!DIR$ IVDEP

See IVDEP (intel.com)

This doesn't "force" vectorization, and even the name is somewhat misleading. There are other directives you can specify that will help the compiler vectorize (Rules for General Directives that Affect DO Loops (intel.com)) In particular, look at VECTOR and NOVECTOR (intel.com)

0 Kudos
Reply