Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Slowdown when using pointer instead of allocatable

eoseret
Beginner
576 Views
Hello,
When I compile with ifort 14.0.1 the following loop using an allocatable instead of a generic pointer for the val field of i1_t, I get a ~15-20% speedup. Culprit are extra address calculation instructions (2x LEA+IMUL, the loop being unrolled twice). Is ifort too conservative here, is there a "performance bug" ?
Thank you in advance
<!--break-->
type i1_t
integer, pointer :: val(:) ! speedup when replacing pointer with allocatable
(...)
end type i1_t

do i = 1, size_outer
  elt => elts(i)%p ! type(elt) = i1_t

  do j = 1, size_inner
    array (elt%val(j)) = 0.
  end do
end do

In the full application (with a lot of structures i1_t or similar in hot routines), a global 30-40% performance improvement was measured by replacing the pointer keyword with the allocatable one. ifort O3 instead of O2 is not helping... A similar speedup is measured with gfortran 4.8.2 (the only difference being the loop is not unrolled by default).
All required files are enclosed: kernel.f90 and driver.f90

$ ifort -fpp -g -O2 -c kernel.f90 -o kernel_pointer.o
$ ifort -fpp -D ALLOC -g -O2 -c kernel.f90 -o kernel_allocatable.o
$ ifort driver.f90 kernel_pointer.o -o pointer
$ ifort driver.f90 kernel_allocatable.o -o alloc
$ time ./pointer
user    0m0.422s
$ time ./alloc
user    0m0.346s
0 Kudos
5 Replies
Sergio
Beginner
576 Views

The gurus here will be able to give you a much better answer, but the reason for both allocatable arrays and pointers existing in Fortran is performance. Using allocatable arrays avoids pointer aliasing, which limits parallelism.

0 Kudos
eoseret
Beginner
576 Views

Trying to reformat via a new post (not possible to edit an existing one) and using another computer/browser (formatting lost)
Hello,
When I compile with ifort 14.0.1 the following loop using an allocatable instead of a generic pointer for the val field of i1_t, I get a ~15-20% speedup. Culprit are extra address calculation instructions (2x LEA+IMUL, the loop being unrolled twice). Is ifort too conservative here, is there a "performance bug" ?
Thank you in advance
<!--break-->

type i1_t
  integer, pointer :: val(:) ! speedup when replacing pointer with allocatable
  (...)
end type i1_t

do i = 1, size_outer
  elt => elts(i)%p ! type(elt) = i1_t
  do j = 1, size_inner
    array (elt%val(j)) = 0.
  end do
end do

In the full application (with a lot of structures i1_t or similar in hot routines), a global 30-40% performance improvement was measured by replacing the pointer keyword with the allocatable one. ifort O3 instead of O2 is not helping... A similar speedup is measured with gfortran 4.8.2 (the only difference being the loop is not unrolled by default). All required files are enclosed: kernel.f90 and driver.f90

$ ifort -fpp -g -O2 -c kernel.f90 -o kernel_pointer.o
$ ifort -fpp -D ALLOC -g -O2 -c kernel.f90 -o kernel_allocatable.o
$ ifort driver.f90 kernel_pointer.o -o pointer
$ ifort driver.f90 kernel_allocatable.o -o alloc

$ time ./pointer
user 0m0.422s
$ time ./alloc
user 0m0.346s

0 Kudos
eoseret
Beginner
576 Views

Sergio wrote:

The gurus here will be able to give you a much better answer, but the reason for both allocatable arrays and pointers existing in Fortran is performance. Using allocatable arrays avoids pointer aliasing, which limits parallelism.

Thanks, understood: it will be harder for the compiler to generate an optimal code.

My new questions are then:

  • Is elt%val really considered by the compiler as a pointer to an array ? In that case, val(j+1) is contiguous to val(j) and no extra address computation instructions (LEA, IMUL...) should be required in the inner loop (but just base and index registers in memory operands, as with allocatable).
  • In other words, for any compiler that must optimize safely, should aliasing be assumed in the inner loop of my example ? elt is constant and then elt%val is constant, only j is varying, from 1 to size_inner (stride 1).
0 Kudos
IanH
Honored Contributor II
576 Views

I'm not clear whether your new questions assume an allocatable or pointer component.

As already mentioned, it is easier for the compiler to determine that an allocatable component isn't aliased with something else in the statement.  From your code example I'm not so sure that's relevant here though.

An allocatable component is also always contiguous.  A non-contiguous pointer component may not be - it could have a stride other than one.  I think this is the issue here.

Note that you can declare a pointer component to be CONTIGUOUS (F2008), in which case it is always ... contiguous!

Separate to performance, allocatable components also have "code safety" benefits, in terms of automatic management of their lifetime.

As a general rule, if you don't need to point at something else, then don't use pointers.

0 Kudos
jimdempseyatthecove
Honored Contributor III
576 Views

IanH,

I ran a test variant of eoseret's program where the val pointer was attributed with CONTIGUOUS. This made a very slight change in performance.

SINGLE_INDIR, USE_CONTIGUOUS, time
undef, undef, 3.395
def, undef, 3.392
undef, def, 3.392
def,def, 3.398

[fortran]

!  PtrVsArray.f90
!
!  FUNCTIONS:
!  PtrVsArray - Entry point of console application.
!

!****************************************************************************
!
!  PROGRAM: PtrVsArray
!
!  PURPOSE:  Entry point for the console application.
!
!****************************************************************************
module mymodule

  type i1_t
#ifdef USE_CONTIGUOUS
     integer, CONTIGUOUS, pointer :: val(:)
#else
     integer, pointer :: val(:)
#endif    
     type(i1_t), pointer :: foo
  end type i1_t

  type elt_t
     integer :: bar
     type (i1_t), pointer :: p
  end type elt_t

contains

  subroutine myroutine(size_outer, size_inner, elts, array)

    implicit none

    integer :: size_outer, size_inner, i, j
#ifdef USE_CONTIGUOUS
    type(elt_t), pointer :: elt
    type(elt_t), CONTIGUOUS, pointer :: elts(:)
#else
    type(elt_t), pointer :: elt, elts(:)
#endif   
    real, pointer :: array(:)
#ifdef SINGLE_INDIR
    type (i1_t), pointer :: elt_p
#endif

    do i = 1, size_outer
       elt => elts(i)
#ifdef SINGLE_INDIR
       elt_p => elt%p
#endif

       do j = 1, size_inner
#ifdef SINGLE_INDIR
          array (elt_p%val(j)) = 0.
#else
          array (elt%p%val(j)) = 0.
#endif
       end do
    end do

  end subroutine myroutine

end module mymodule


program PtrVsArray
  use omp_lib
  use mymodule

  implicit none
    real(8) :: t0
  integer i, j
  integer, target :: ind (1000)
  type(i1_t), target :: i1 (1000)
  type(elt_t), target :: elts (1000)
  real, target :: array (1000)
  type(elt_t), pointer :: elts_ptr(:)
  real, pointer :: array_ptr(:)

  elts_ptr => elts
  array_ptr => array

  do i = 1, 1000
     ind(i) = i
     i1(i)%val => ind
     elts(i)%p => i1(i)
  end do
    t0 = omp_get_wtime()
  do i = 1, 5000
     call myroutine (1000, 1000, elts_ptr, array_ptr)
  end do
   t0 = omp_get_wtime() - t0
    write(*,*) t0
    end program PtrVsArray
[/fortran]

Jim Dempsey

0 Kudos
Reply