Expanding a dynamically allocated array of a derived type: is there any standard Fortran option to achieve better performance?

FortranFan · ‎11-02-2016

Say one has a dynamically allocated array of a derived type of size N and the array needs to be expanded in order to add a few more elements to the back of the array while retaining all the existing values: is there any standard Fortran option that can yield better performance than the one described in Modern Fortran Explained by Metcalf et al.?

Per Metcalf et al. (see section 15.5.3 on Transferring an allocation in Modern Fortran Explained), starting with Fortran 2003 and the introduction of MOVE_ALLOC intrinsic that reduces 2 copy steps to one, the recommended approach on changing the size of an dynamically allocated array appears to be to first allocate a temporary array with the desired new shape, copy the existing array onto the temporary, add desired new elements, and employ MOVE_ALLOC to transfer the allocation of temporary to the actual array. The canonical sequence is:

.. 
type(foo_t), allocatable :: foo(:)   !** foo_t mainly holds an array of real64 
type(foo_t), allocatable :: tmp(:)   !   intrinsic type 
.. 
! say foo is currently allocated with size N 
allocate (tmp(M))                    ! M on the order of 2*N 
tmp(1:N) = foo                       ! copy existing values of foo 
tmp(..) = ..                         ! add new elements 
call move_alloc( from=tmp, to=foo )  ! transfer allocation back to foo 
..

The above approach is straight-forward but it can become quite time-consuming as foo gets bigger and bigger. Is there a better option than that described above?

More specifically, is there any approach with standard Fortran that can provide a couple of orders of magnitude or better performance than that noticed using the above approach?

For a use case I have where foo gets appreciably large relative to system resources (i.e., N becomes large and so does the data held by each element of foo), the above shown sequence can take anywhere from 2 to 100 CPU seconds for a case being tried: the copy operation to transfer existing elements of foo to the temporay array as well as the MOVE_ALLOC step take up most of this time, as one would expect, with the copy step taking up most of the time. This then becomes a performance bottleneck for the task at hand.

It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size! That is, what is being sought is two orders of magnitude or better in performance improvement for the operation of adding more elements to the end of a dynamically allocated array of a derived type! Is this even possible to achieve this using *standard Fortran* (i.e., no OpenMP/MPI or vendor-specific directives, etc. are to be considered at this stage) while retaining the basic design where foo remains an ALLOCATABLE rank-one array of foo_t derived type?

Thanks much in advance,

jimdempseyatthecove · ‎11-03-2016

type foo_t
  ...
end type foo_t

type p_foo_t
  type(foo_t), pointer :: p
end type p_foo_t

type(p_foo_t), allocatable :: p_foo(:) !** holds pointers to foo_t

You can reduce the amount of data copied and allocated by managing pointers to foo_t ...
... and it does not cost much to over allocate the pointer array and fill with NULL pointers thus reducing the number of times you extend the array.

Jim Dempsey

mecej4 · ‎11-03-2016

You need to answer a critical question first: Do the calculations involved in Line-8 depend on the elements being moved to their correct positions in Line-7? If not, keep the chunks distinct until you have finished forming all of them, and do a final gather operation in which you consolidate all the chunks into the final destination.

As you have it now, if there are D doubling steps, Line-7 involves copying N items D-1 times, 2N items D-2 times, 4N items D-3 times, ...2^mN items D-m-1 times, ... What I suggested would replace all this, plus the overhead of MOVE_ALLOC, with a final copy/move phase of 2^DN items.

The suggestion needs to be refined for modern computers with multiple levels of cache memory.

It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size!

This seems to be an arbitrary wish with no basis, no relation to the computing facilities available or the complexity of the problem being solved. Why not wish for 0.01 milliseconds while you are at it?

jimdempseyatthecove · ‎11-03-2016

FWIW - mecej4's suggestion (chunking) is effectively performed by overallocating the pointer array.

If for stylistic reasons (you do not like using p_foo(I)%p%... or foo(I)%p%...) then after the working number of foo_t has been established, you can then allocate the required array of foo_t and perform the copy once.

An additional side effect of using the pointer arrays can be found on larger NUMA based systems...
... provided you code for the computational thread using the node is the same as the allocation thread.

Jim Dempsey

FortranFan · ‎11-04-2016

jimdempseyatthecove wrote:

.. You can reduce the amount of data copied and allocated by managing pointers to foo_t ...
... and it does not cost much to over allocate the pointer array and fill with NULL pointers thus reducing the number of times you extend the array. ..

Jim, mecej4;

Thanks much for your feedback.

Fyi, the situation described in the original post was something I have been wanting to check for a while, especially because there was an application prototype for a manufacturing plant data analysis that a team, which used to sit close to me, were developing some time ago and where they had quickly concluded C++ was the best option for them because of the high performance requirements of the application. I had listened to some of their talks about the 'move semantics' introduced with C++ 11 (you are probably quite familiar with it e.g., https://en.wikipedia.org/wiki/Move_assignment_operator) and the smart pointers with C++ 14 (which, by the way, look to me very much like ALLOCATABLEs in Fortran) and so on and how these 'modern C++' features were helping the application achieve very high performance in their various helper classes that needed to quickly construct and dispose of object instances which processed varying streams of data whose size would be appreciable relative to system resources. I had few details, but I had recalled seeing something about the importance of rapid 'push_back' operations with this application where a few additional chunks of data need to be added at the end of a fairly large data set. I was just trying to mimic such a scenario in the simplest of manner using Fortran and thus my original post; given this background, I wanted to avoid the use of pointers as much as possible. But note, it is just a thought exercise.

mecej4 wrote:

FortranFan wrote:

.. It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size!..

.. This seems to be an arbitrary wish with no basis, no relation to the computing facilities available or the complexity of the problem being solved. Why not wish for 0.01 milliseconds while you are at it? ..

So yes, the above is indeed an arbitrary wish, one that is based on what I noticed to be the performance of a similar operation (during my simple-minded attempt) using C/C++.

What I was coming to was this: if the user-defined derived type (FOO_T in the original post) is designed appropriately with components that of ALLOCATABLE type and if suitable 'move' operations are made available, then a sequence container (an array being the simplest example) for the type can be resized without much overhead. It is nothing complicated, in fact it's quite trivial, but which coders might overlook thereby incurring significant overhead during such an operation. In the case of C++, it would appear the 'good practices' that coders should take with their class design has been established to some extent with the Rule of FIve but it's unclear if any such recommendations exist for Fortranners. For many Fortran coders who are considering the MOVE_ALLOC intrinsic might only notice the standard or the book by Metcalf et al. and not much else in terms of 'formal' guidance.

So to elaborate on this a bit further, shown below are some results with a rather simple code example which is listed further down in this post. The premise is this, instead of following the book example, if one employs the MOVE_ALLOC intrinsic for the component(s) of the FOO_T derived type in the original post, considerable improvements can be noticed.

In the code below, two sequences are attempted: 1) resize_copy, a 'contain'ed procedure that follows the steps shown by Metcalf et al. and which were described in the original post and 2) resize_move that in turn does the 'move' on the component(s) instead of copy. Readers will notice the performance differences between the two approaches.

Here're some execution results with x64 target on a Windows 7 workstation with 16 GB RAM:

Container       CPU Time        CPU Time   
Size            resize_copy     resize_move 
(approx. GB)    (seconds)       (seconds)   
-------------------------------------------   
1               0.33            <0.01     
4               2.8             <0.01       
8               160             <0.01

Here's the simple code for this:

module mykinds_m 

   use, intrinsic :: iso_fortran_env, only : I8 => int64, WP => real64 

   implicit none 

end module

module foo_m 

   use mykinds_m, only : WP 

   implicit none 

   private 

   type, public :: foo_t 
      real(WP), allocatable :: dat(:) 
   contains 
      procedure, pass(this), public :: set => set_foo 
   end type foo_t 

   public :: move_foo_dat 

contains 

   elemental subroutine move_foo_dat( rhs, lhs ) 

      type(foo_t), intent(inout) :: rhs 
      type(foo_t), intent(inout) :: lhs 

      call move_alloc( from=rhs%dat, to=lhs%dat ) 

      return 

   end subroutine move_foo_dat 

   impure elemental subroutine set_foo( this, n ) 

      class(foo_t), intent(inout) :: this 
      integer, intent(in)         :: n 

      if ( allocated(this%dat) ) then 
         deallocate( this%dat) 
      end if 

      allocate( this%dat(n) ) 
      call random_number( this%dat ) 

      return 

   end subroutine set_foo 

end module foo_m

program p 

   use mykinds_m, only : I8, WP 
   use foo_m, only : foo_t, move_foo_dat 

   implicit none 

   integer, parameter :: N = 100000 
   integer, parameter :: DATA_SIZE = 15000 
   type(foo_t), allocatable :: foo(:) 

   print *, "Initial array size, N = ", N 
   print *, "Size of data in each array element, DATA_SIZE = ", DATA_SIZE 

   ! Set up foo array 
   allocate( foo(N) ) 
   call foo%set( DATA_SIZE ) 

   print *, "Before reallocation:" 
   print *, "size(foo) = ", size(foo) 
   print *, "foo(N)%dat(1) = ", foo(N)%dat(1) 
   print * 

   !call resize_copy() 
   call resize_move() 

   print * 
   print *, "After reallocation:" 
   print *, "size(foo) = ", size(foo) 
   print *, "foo(N)%dat(1) = ", foo(N)%dat(1) 

   stop 

contains 

   subroutine resize_copy() 

      type(foo_t), allocatable :: tmp(:) 
      real(WP) :: start_time 
      real(WP) :: end_time 

      call my_cpu_time( start_time ) 

      ! Canonical sequence for array growth per Metcalf et al. 
      allocate( tmp(2*N) ) 
      tmp(1:N) = foo 
      call move_alloc( from=tmp, to=foo ) 

      call my_cpu_time( end_time ) 

      print "(*(g0.4))", "Reallocation sequence: CPU time = ", (end_time - start_time),             & 
                         " seconds." 

      return 

   end subroutine resize_copy 

   subroutine resize_move() 

      type(foo_t), allocatable :: tmp(:) 
      real(WP) :: start_time 
      real(WP) :: end_time 

      call my_cpu_time( start_time ) 

      ! Possible sequence for array growth for types with ALLOCATABLE components 
      allocate( tmp(2*N) ) 
      call move_foo_dat( foo, tmp(1:N) ) 
      call move_alloc( from=tmp, to=foo ) 

      call my_cpu_time( end_time ) 

      print "(*(g0.4))", "Reallocation sequence: CPU time = ", (end_time - start_time),             & 
                         " seconds." 

      return 

   end subroutine resize_move 

   subroutine my_cpu_time( time ) 

      !.. Argument list 
      real(WP), intent(inout) :: time 

      !.. Local variables 
      integer(I8) :: tick 
      integer(I8) :: rate 

      call system_clock (tick, rate) 

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) 

      return 

   end subroutine my_cpu_time 

end program p

andrew_4619 · ‎11-04-2016

Thanks for the really good example using 'modern methods'. Faced with that problem I would have done the resize_copy method a la Metcalf without a seconds further thought. I like the use of the elemental procedure, very neat. Thanks again for taking the time.