Say one has a dynamically allocated array of a derived type of size N and the array needs to be expanded in order to add a few more elements to the back of the array while retaining all the existing values: is there any standard Fortran option that can yield better performance than the one described in Modern Fortran Explained by Metcalf et al.?
Per Metcalf et al. (see section 15.5.3 on Transferring an allocation in Modern Fortran Explained), starting with Fortran 2003 and the introduction of MOVE_ALLOC intrinsic that reduces 2 copy steps to one, the recommended approach on changing the size of an dynamically allocated array appears to be to first allocate a temporary array with the desired new shape, copy the existing array onto the temporary, add desired new elements, and employ MOVE_ALLOC to transfer the allocation of temporary to the actual array. The canonical sequence is:
.. type(foo_t), allocatable :: foo(:) !** foo_t mainly holds an array of real64 type(foo_t), allocatable :: tmp(:) ! intrinsic type .. ! say foo is currently allocated with size N allocate (tmp(M)) ! M on the order of 2*N tmp(1:N) = foo ! copy existing values of foo tmp(..) = .. ! add new elements call move_alloc( from=tmp, to=foo ) ! transfer allocation back to foo ..
The above approach is straight-forward but it can become quite time-consuming as foo gets bigger and bigger. Is there a better option than that described above?
More specifically, is there any approach with standard Fortran that can provide a couple of orders of magnitude or better performance than that noticed using the above approach?
For a use case I have where foo gets appreciably large relative to system resources (i.e., N becomes large and so does the data held by each element of foo), the above shown sequence can take anywhere from 2 to 100 CPU seconds for a case being tried: the copy operation to transfer existing elements of foo to the temporay array as well as the MOVE_ALLOC step take up most of this time, as one would expect, with the copy step taking up most of the time. This then becomes a performance bottleneck for the task at hand.
It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size! That is, what is being sought is two orders of magnitude or better in performance improvement for the operation of adding more elements to the end of a dynamically allocated array of a derived type! Is this even possible to achieve this using *standard Fortran* (i.e., no OpenMP/MPI or vendor-specific directives, etc. are to be considered at this stage) while retaining the basic design where foo remains an ALLOCATABLE rank-one array of foo_t derived type?
Thanks much in advance,
type foo_t ... end type foo_t type p_foo_t type(foo_t), pointer :: p end type p_foo_t type(p_foo_t), allocatable :: p_foo(:) !** holds pointers to foo_t
You can reduce the amount of data copied and allocated by managing pointers to foo_t ...
... and it does not cost much to over allocate the pointer array and fill with NULL pointers thus reducing the number of times you extend the array.
You need to answer a critical question first: Do the calculations involved in Line-8 depend on the elements being moved to their correct positions in Line-7? If not, keep the chunks distinct until you have finished forming all of them, and do a final gather operation in which you consolidate all the chunks into the final destination.
As you have it now, if there are D doubling steps, Line-7 involves copying N items D-1 times, 2N items D-2 times, 4N items D-3 times, ...2mN items D-m-1 times, ... What I suggested would replace all this, plus the overhead of MOVE_ALLOC, with a final copy/move phase of 2DN items.
The suggestion needs to be refined for modern computers with multiple levels of cache memory.
It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size!This seems to be an arbitrary wish with no basis, no relation to the computing facilities available or the complexity of the problem being solved. Why not wish for 0.01 milliseconds while you are at it?
FWIW - mecej4's suggestion (chunking) is effectively performed by overallocating the pointer array.
If for stylistic reasons (you do not like using p_foo(I)%p%... or foo(I)%p%...) then after the working number of foo_t has been established, you can then allocate the required array of foo_t and perform the copy once.
An additional side effect of using the pointer arrays can be found on larger NUMA based systems...
... provided you code for the computational thread using the node is the same as the allocation thread.
.. You can reduce the amount of data copied and allocated by managing pointers to foo_t ...
... and it does not cost much to over allocate the pointer array and fill with NULL pointers thus reducing the number of times you extend the array. ..
Thanks much for your feedback.
Fyi, the situation described in the original post was something I have been wanting to check for a while, especially because there was an application prototype for a manufacturing plant data analysis that a team, which used to sit close to me, were developing some time ago and where they had quickly concluded C++ was the best option for them because of the high performance requirements of the application. I had listened to some of their talks about the 'move semantics' introduced with C++ 11 (you are probably quite familiar with it e.g., https://en.wikipedia.org/wiki/
.. It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size!..
.. This seems to be an arbitrary wish with no basis, no relation to the computing facilities available or the complexity of the problem being solved. Why not wish for 0.01 milliseconds while you are at it? ..
So yes, the above is indeed an arbitrary wish, one that is based on what I noticed to be the performance of a similar operation (during my simple-minded attempt) using C/C++.
What I was coming to was this: if the user-defined derived type (FOO_T in the original post) is designed appropriately with components that of ALLOCATABLE type and if suitable 'move' operations are made available, then a sequence container (an array being the simplest example) for the type can be resized without much overhead. It is nothing complicated, in fact it's quite trivial, but which coders might overlook thereby incurring significant overhead during such an operation. In the case of C++, it would appear the 'good practices' that coders should take with their class design has been established to some extent with the Rule of FIve but it's unclear if any such recommendations exist for Fortranners. For many Fortran coders who are considering the MOVE_ALLOC intrinsic might only notice the standard or the book by Metcalf et al. and not much else in terms of 'formal' guidance.
So to elaborate on this a bit further, shown below are some results with a rather simple code example which is listed further down in this post. The premise is this, instead of following the book example, if one employs the MOVE_ALLOC intrinsic for the component(s) of the FOO_T derived type in the original post, considerable improvements can be noticed.
In the code below, two sequences are attempted: 1) resize_copy, a 'contain'ed procedure that follows the steps shown by Metcalf et al. and which were described in the original post and 2) resize_move that in turn does the 'move' on the component(s) instead of copy. Readers will notice the performance differences between the two approaches.
Here're some execution results with x64 target on a Windows 7 workstation with 16 GB RAM:
Container CPU Time CPU Time Size resize_copy resize_move (approx. GB) (seconds) (seconds) ------------------------------------------- 1 0.33 <0.01 4 2.8 <0.01 8 160 <0.01
Here's the simple code for this:
module mykinds_m use, intrinsic :: iso_fortran_env, only : I8 => int64, WP => real64 implicit none end module
module foo_m use mykinds_m, only : WP implicit none private type, public :: foo_t real(WP), allocatable :: dat(:) contains procedure, pass(this), public :: set => set_foo end type foo_t public :: move_foo_dat contains elemental subroutine move_foo_dat( rhs, lhs ) type(foo_t), intent(inout) :: rhs type(foo_t), intent(inout) :: lhs call move_alloc( from=rhs%dat, to=lhs%dat ) return end subroutine move_foo_dat impure elemental subroutine set_foo( this, n ) class(foo_t), intent(inout) :: this integer, intent(in) :: n if ( allocated(this%dat) ) then deallocate( this%dat) end if allocate( this%dat(n) ) call random_number( this%dat ) return end subroutine set_foo end module foo_m
program p use mykinds_m, only : I8, WP use foo_m, only : foo_t, move_foo_dat implicit none integer, parameter :: N = 100000 integer, parameter :: DATA_SIZE = 15000 type(foo_t), allocatable :: foo(:) print *, "Initial array size, N = ", N print *, "Size of data in each array element, DATA_SIZE = ", DATA_SIZE ! Set up foo array allocate( foo(N) ) call foo%set( DATA_SIZE ) print *, "Before reallocation:" print *, "size(foo) = ", size(foo) print *, "foo(N)%dat(1) = ", foo(N)%dat(1) print * !call resize_copy() call resize_move() print * print *, "After reallocation:" print *, "size(foo) = ", size(foo) print *, "foo(N)%dat(1) = ", foo(N)%dat(1) stop contains subroutine resize_copy() type(foo_t), allocatable :: tmp(:) real(WP) :: start_time real(WP) :: end_time call my_cpu_time( start_time ) ! Canonical sequence for array growth per Metcalf et al. allocate( tmp(2*N) ) tmp(1:N) = foo call move_alloc( from=tmp, to=foo ) call my_cpu_time( end_time ) print "(*(g0.4))", "Reallocation sequence: CPU time = ", (end_time - start_time), & " seconds." return end subroutine resize_copy subroutine resize_move() type(foo_t), allocatable :: tmp(:) real(WP) :: start_time real(WP) :: end_time call my_cpu_time( start_time ) ! Possible sequence for array growth for types with ALLOCATABLE components allocate( tmp(2*N) ) call move_foo_dat( foo, tmp(1:N) ) call move_alloc( from=tmp, to=foo ) call my_cpu_time( end_time ) print "(*(g0.4))", "Reallocation sequence: CPU time = ", (end_time - start_time), & " seconds." return end subroutine resize_move subroutine my_cpu_time( time ) !.. Argument list real(WP), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine my_cpu_time end program p
Thanks for the really good example using 'modern methods'. Faced with that problem I would have done the resize_copy method a la Metcalf without a seconds further thought. I like the use of the elemental procedure, very neat. Thanks again for taking the time.