Community
cancel
Showing results for 
Search instead for 
Did you mean: 
dhenty
Beginner
251 Views

Vectorisation issues with allocatable array

I have the following kernel (all arrays are integers):

where(old(1:M,1:N) /= 0) &

new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
old(2:M+1,1:N), &
old(1:M,0:N-1), &
old(1:M,2:N+1) )

and it is about 3 times slower if I use allocatable arrays rather than just declaring statically (all with dimensions fixed at compile time).

I have an equivalent loop-based C version which also shows the same effect - 3 times slower with malloc'd arrays vs static arrays. However, in C there is a genuine potential pointer-aliasing issue between new and old and this can be fixed with an "ivdep" on the inner loop. In Fortran there is surely no potential aliasing issue even with allocatables so why is the compiler not vectorising? Can I apply "ivdep" to array syntax expressions like the above?

 

0 Kudos
11 Replies
Steve_Lionel
Black Belt Retired Employee
227 Views

Please provide a small but compilable test case. I would be interested to see what the optimization report has to say about it. The use of WHERE may also be an issue.

dhenty
Beginner
207 Views

With the appended code I get about 1.3 seconds with static arrays and 2.7 with allocatables:

dsh@laptop$ ifort --version
ifort (IFORT) 2021.2.0 20210228
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

dsh@laptop$ ifort -O3 -o wheretest wheretest.f90  # static
dsh@laptop$ time ./wheretest
new(1,1) = 575

real 0m1.265s
user 0m1.254s
sys 0m0.008s
dsh@laptop$ ifort -O3 -o wheretest wheretest.f90  # allocatables
dsh@laptop$ time ./wheretest
new(1,1) = 575

real 0m2.727s
user 0m2.722s
sys 0m0.005s

program wheretest

  implicit none

  integer, parameter :: M = 576, N = 576
  integer :: i

  integer, dimension(0:M+1,0:N+1) :: old, new

!  integer, dimension(:,:), allocatable :: old, new
!  allocate(old(0:M+1,0:N+1), new(0:M+1,0:N+1) )

  old(:,:) =  reshape( [ (mod(i,M), i=1,(M+2)*(N+2)) ], shape(old) )

  do i = 1, 4000

     where(old(1:M,1:N) /= 0) &

          new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
                                           old(2:M+1,1:N), &
                                           old(1:M,0:N-1), &
                                           old(1:M,2:N+1)    )

     old(1:M,1:N) = new(1:M,1:N)
     
  end do

  write(*,*) "new(1,1) = ", new(1,1)
  
end program wheretest

 

andrew_4619
Valued Contributor III
199 Views

Are you timing the whole program? Is the time taken to allocate significant? Maybe a  timing around the work might be more interesting. 

dhenty
Beginner
196 Views

Initialisation is insignificant compared to the 4000 iterations of the "do" loop - doubling the trip count to 8000 doubles the elapsed time.

jimdempseyatthecove
Black Belt
153 Views

Your program has a bug in it.

Line 24 copies an undefined value of new from indices of old where old contained 0.0.

I suggest you use:

...
  do i = 1, 4000

     new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
                                           old(2:M+1,1:N), &
                                           old(1:M,0:N-1), &
                                           old(1:M,2:N+1)    )

     where(old(1:M,1:N) /= 0) old(1:M,1:N) = new(1:M,1:N)
     
  end do
...

Jim Dempsey

dhenty
Beginner
145 Views

When I hastily ripped this kernel from the main program I forgot the initialisation of new which should be set to zero outside of the main loop. However, this doesn't significantly affect the result where the loop is almost twice as fast for static arrays vs allocatables.

 

 

JohnNichols
Valued Contributor I
129 Views

If you could explain what you are trying to achieve - there are reasons for the alternatives, but the best choice depends on the other things?

dhenty
Beginner
101 Views

My question is: why does identical code run twice as fast with static arrays vs allocatables. What the code does isn't really that relevant - it's just representative of simple stencil operations. It appears to be due to vectorisation because, in an equivalent C-code, adding #pragma ivdep fixes the issue for malloc'd arrays.

Barbara_P_Intel
Moderator
88 Views

Did you look at the optimization reports?  The static version was vectorized.

 

dhenty
Beginner
60 Views

The report confirms that the static version is being vectorised:

LOOP BEGIN at wheretest.f90(20,11)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at wheretest.f90(20,11)
remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at wheretest.f90(20,11)
<Remainder loop for vectorization>
LOOP END

 but with allocatables it isn't:

LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported

LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported
LOOP END
LOOP END

but I'd still like to understand why, and whether there is a directive I could use here to force vectorisation as I was able to do using #pragma ivdep in the C version.

Steve_Lionel
Black Belt Retired Employee
48 Views

Intel Fortran supports:

!DIR$ IVDEP

See IVDEP (intel.com)

This doesn't "force" vectorization, and even the name is somewhat misleading. There are other directives you can specify that will help the compiler vectorize (Rules for General Directives that Affect DO Loops (intel.com)) In particular, look at VECTOR and NOVECTOR (intel.com)

Reply