- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the following kernel (all arrays are integers):
where(old(1:M,1:N) /= 0) &
new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
old(2:M+1,1:N), &
old(1:M,0:N-1), &
old(1:M,2:N+1) )
and it is about 3 times slower if I use allocatable arrays rather than just declaring statically (all with dimensions fixed at compile time).
I have an equivalent loop-based C version which also shows the same effect - 3 times slower with malloc'd arrays vs static arrays. However, in C there is a genuine potential pointer-aliasing issue between new and old and this can be fixed with an "ivdep" on the inner loop. In Fortran there is surely no potential aliasing issue even with allocatables so why is the compiler not vectorising? Can I apply "ivdep" to array syntax expressions like the above?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please provide a small but compilable test case. I would be interested to see what the optimization report has to say about it. The use of WHERE may also be an issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the appended code I get about 1.3 seconds with static arrays and 2.7 with allocatables:
dsh@laptop$ ifort --version
ifort (IFORT) 2021.2.0 20210228
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.
dsh@laptop$ ifort -O3 -o wheretest wheretest.f90 # static
dsh@laptop$ time ./wheretest
new(1,1) = 575
real 0m1.265s
user 0m1.254s
sys 0m0.008s
dsh@laptop$ ifort -O3 -o wheretest wheretest.f90 # allocatables
dsh@laptop$ time ./wheretest
new(1,1) = 575
real 0m2.727s
user 0m2.722s
sys 0m0.005s
program wheretest
implicit none
integer, parameter :: M = 576, N = 576
integer :: i
integer, dimension(0:M+1,0:N+1) :: old, new
! integer, dimension(:,:), allocatable :: old, new
! allocate(old(0:M+1,0:N+1), new(0:M+1,0:N+1) )
old(:,:) = reshape( [ (mod(i,M), i=1,(M+2)*(N+2)) ], shape(old) )
do i = 1, 4000
where(old(1:M,1:N) /= 0) &
new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
old(2:M+1,1:N), &
old(1:M,0:N-1), &
old(1:M,2:N+1) )
old(1:M,1:N) = new(1:M,1:N)
end do
write(*,*) "new(1,1) = ", new(1,1)
end program wheretest
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you timing the whole program? Is the time taken to allocate significant? Maybe a timing around the work might be more interesting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Initialisation is insignificant compared to the 4000 iterations of the "do" loop - doubling the trip count to 8000 doubles the elapsed time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your program has a bug in it.
Line 24 copies an undefined value of new from indices of old where old contained 0.0.
I suggest you use:
...
do i = 1, 4000
new(1:M,1:N) = max(old(1:M,1:N), old(0:M-1,1:N), &
old(2:M+1,1:N), &
old(1:M,0:N-1), &
old(1:M,2:N+1) )
where(old(1:M,1:N) /= 0) old(1:M,1:N) = new(1:M,1:N)
end do
...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I hastily ripped this kernel from the main program I forgot the initialisation of new which should be set to zero outside of the main loop. However, this doesn't significantly affect the result where the loop is almost twice as fast for static arrays vs allocatables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you could explain what you are trying to achieve - there are reasons for the alternatives, but the best choice depends on the other things?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My question is: why does identical code run twice as fast with static arrays vs allocatables. What the code does isn't really that relevant - it's just representative of simple stencil operations. It appears to be due to vectorisation because, in an equivalent C-code, adding #pragma ivdep fixes the issue for malloc'd arrays.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you look at the optimization reports? The static version was vectorized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The report confirms that the static version is being vectorised:
LOOP BEGIN at wheretest.f90(20,11)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at wheretest.f90(20,11)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at wheretest.f90(20,11)
<Remainder loop for vectorization>
LOOP END
but with allocatables it isn't:
LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported
LOOP BEGIN at wheretest.f90(20,11)
remark #25460: No loop optimizations reported
LOOP END
LOOP END
but I'd still like to understand why, and whether there is a directive I could use here to force vectorisation as I was able to do using #pragma ivdep in the C version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel Fortran supports:
!DIR$ IVDEP
This doesn't "force" vectorization, and even the name is somewhat misleading. There are other directives you can specify that will help the compiler vectorize (Rules for General Directives that Affect DO Loops (intel.com)) In particular, look at VECTOR and NOVECTOR (intel.com)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page