Fastest Way to Initialize Variables

Robert · ‎08-09-2019

Most Efficient Way To Initialize Variables.

Say one has the following variable:

doubleprecision, dimension(6,10000) :: dblResults = 0.0d0

Every analysis cycle one needs to re-zero the dblResults array. The first dimension of 6 is always used to its full extent. The second dimension is “number of members” and is set at 10000 as an upper limit that should never be reached. Assume in this run of the analysis the first 100 positions of the second dimension are actually used, i.e. intNumMem = 100

! which of the following options runs fastest to re-zero the dblResults array for the next analysis cycle and is the preferred way to write the code. (Or is there yet a different and even better way to re-zero the array?)

! option 1

dblResults = 0.0d0

! option 2

dblResults(1:6, 1:intNumMem) = 0.0d0

! option 3

do i = 1, 6

do j = 1, intNumMem

dblResults (i, j) = 0.0d0

end do

! option 4

forall (i=1:6, j=1,intNumMem)

dblResults(i, j) = 0.0d0

end forall

Thank you very much in advance for your comments.

Bob

Robert · ‎08-09-2019

! Option 5 (yet another idea to re-zero the dblResults array for the next use.)

Where (dblResults /= 0.0d0)

dblResults = 0.0d0

End Where

andrew_4619 · ‎08-09-2019

Well option 1 is by far the neatest code and gets my vote. The actual 'saving' by any other method is really going to be very small so why worry about it! When ever I have made timing tests of such things I have had to repeat a very large number of times so be able to see a meaningful time difference like the blink of an eye sort of time.

A better bet is if you can determine at run time how big intNumMem needs to be then make the array allocatable and of the correct size.

jimdempseyatthecove · ‎08-09-2019

andrew_4619>> The actual 'saving' by any other method is really going to be very small

Not when

Bob>> “number of members” and is set at 10000 as an upper limit that should never be reached. Assume in this run of the analysis the first 100 positions of the second dimension are actually used, i.e. intNumMem = 100

Option 2 should produce the best results. (should, though check with VTune)

If the time consumed is really an issue, then experiment with passing the contiguous slice of the 2D array to a subroutine taking the shape of a 1D array with size of the 6*intNumMem passed in (IOW as you may have done in Fortran 77). *** This would only be considered if (when) the compiler optimization did NOT use the AVXnnn instructions (or intel_fast_... intrinsic) to wipe the contiguous array (slice).

Jim Dempsey

andrew_4619 · ‎08-10-2019

jimdempseyatthecove (Blackbelt) wrote:
andrew_4619>> The actual 'saving' by any other method is really going to be very small
Not when
Bob>> “number of members” and is set at 10000 as an upper limit that should never be reached. Assume in this run of the analysis the first 100 positions of the second dimension are actually used, i.e. intNumMem = 100

My point was Jim that 1000*verylittletime is still verylittletime to a human. If speed is a real issue using an allocated array of the right size makes more sense to me.

FortranFan · ‎08-10-2019

@Robert,

See excellent responses by Andrew and Jim, both of which can guide you well in terms of 2 considerations: code readability-maintenance and performance. As advised, you may want to look into utilizing the ALLOCATABLE facility that can help you work with right-sized datasets. If that is not possible, you can consider your Option 2, or a variant

dblResults(:,1:intNumMem) = 0.0d0

which informs a reader of your code (who may be you yourself in a future incarnation!) clearly of the array section that is zeroed out!

Robert · ‎08-11-2019

Thank y’all for the replies. Fortran has several ways to accomplish equivalent tasks, often for legacy code compatibility. It is unclear sometimes how different code affects speed or what is preferred coding practice.

Steve_Lionel · ‎08-11-2019

My preference would be for Option 1. The compiler can implement this very efficiently. Option 2 would make sense only if performance analysis showed this initialization to be a bottleneck, which I very much doubt.

Option 3 should have the loops reversed, though the compiler will probably do that for you. Option 4 is probably next to worst (and FORALL is deprecated), option 5 would be worst.

JAlexiou · ‎08-12-2019

Steve Lionel (Ret.) (Blackbelt) wrote:
My preference would be for Option 1. The compiler can implement this very efficiently. Option 2 would make sense only if performance analysis showed this initialization to be a bottleneck, which I very much doubt.
Option 3 should have the loops reversed, though the compiler will probably do that for you. Option 4 is probably next to worst (and FORALL is deprecated), option 5 would be worst.

I agree. In my testing, the second way is the slowest (maybe). Although the compiler is smart enough to optimize all the situations equally (if possible).

Consider the following four subroutines compiled with ifort 19.0 linux with -O2 -mavx command line

pure subroutine fill1(a,x)
real(8), intent(out) :: a(:,:)
real(8), intent(in) :: x
integer :: n, m, i, j
    a = x
end subroutine
pure subroutine fill2(a,x)
real(8), intent(out) :: a(:,:)
real(8), intent(in) :: x
integer :: n, m, i, j
    n = size(a,1)
    m = size(a,2)
    a(1:n,1:m) = x
end subroutine
pure subroutine fill3(a,x)
real(8), intent(out) :: a(:,:)
real(8), intent(in) :: x
integer :: n, m, i, j
    n = size(a,1)
    m = size(a,2)
    do j=1, m
        do i=1, n
            a(i,j) = x
        end do
    end do
end subroutine
pure subroutine fill4(a,x)
real(8), intent(out) :: a(:,:)
real(8), intent(in) :: x
integer :: n, m, i, j
    n = size(a,1)
    m = size(a,2)
    forall(j=1:m, i=1:n) a(i,j)=x
end subroutine

produces the following timings (in function calls per second):

76923.08       58823.53       83333.34       76923.08

but the results depend on the order in which the above is called. I have included a burn-in calculation, as well as avoiding the elimination of unused code by doing something with `a` between calls.

Online code & compiler here https://godbolt.org/z/YtNVPn

FortranFan · ‎08-12-2019

JAlexiou wrote:
.. I agree. In my testing ..

This test is not at all useful: on most computing environments in this day and age the test, as coded, will simply yield the same values for 'tic' and 'toc' and effectively lead to an output of "Infinity" for the number of functions per second for all 4 cases. This reader should pay attention to Andrew's point in Quote #3, "When ever I have made timing tests of such things I have had to repeat a very large number of times so be able to see a meaningful time difference like the blink of an eye sort of time."

jimdempseyatthecove · ‎08-12-2019

For your test to be somewhat of value:

module foo
  doubleprecision, dimension(6,10000) :: dblResults = 0.0d0
  contains
  ! your fill subroutines here
  ...
 ! timing routine
subroutine timeIt(a)
  use foo
  use omp_lib
  integer :: intNumMem
  doubleprecison :: a(:,:)
  doubleprecision :: T0, T1
  T0 = omp_get_wtime()
  call fill1(a, 1234.5D0)
  T1 = omp_get_wtime()
  print *, size(a,1), size(a,2), T1-T0
! ditto for fill2, ..., fill4
  ...
end subroutine timeIt

end module foo

program
  use foo
  integer :: intNumMem
  intNumMem = 10000
  call timeIt(intNumMem) ! throw away this time
  do intNumMem = 10000,1,-1000
    call timeIt(dblResults(1:6,1:intNumMem))
    ! assure optimizer does not elede code
    if(sum(dblResults) == -1.0D0) print *, "Should not print this"
  end do
end program
....

The above is sketch (you debug)

Jim Dempsey

jimdempseyatthecove · ‎08-12-2019

Note, compare the fill1 with a(1:6,1:10000) to all the rest (to meet the timing question as relating to your post #1)

Jim Dempsey

Andriy · ‎08-15-2019

Just my two cents.

The best performance will be achieved for the following code:

real(8), allocatable :: dblResults(:,:)
integer :: intNumMem , j

allocate(dblResults(10000,6))

intNumMem  = 100

do j = 1, 6
    dblResults(1:intNumMem , j) = 0.d0
end do

The key here is to have the largest dimension first as this is how Fortran allocates the memory - column by column. You want to access your results in the same way as they are allocated. If you do everything correctly, intel compiler will replace your loops with a library function.

You possibly could further improve performance by vectorizing your code.

Second, somebody had subroutine parameters as

real(8), intent(out) :: a(:,:)

Really bad idea. For out parameters Fortran is allowed to reallocate the memory which becomes a performance issue for large arrays.

Use in or inout for arrays and it will never fail you.

Steve_Lionel · ‎08-15-2019

INTENT(OUT) only matters for allocatable array dummy arguments, where the array will be deallocated if it is allocated on entry. If your procedure just writes the array, and you don't want (re)allocation, don't use ALLOCATABLE when declaring the dummy argument.

jimdempseyatthecove · ‎08-15-2019

Andriy,

The first dimension is the closest proximity of cells, the last dimension is the furthest. IOW

dblResults(I,J) is adjacent to dblResults(I+1,J)

whereas

dblResults(I,J) is separated from dblResults(I,J+1) by the number of elements of dim(dblResults,1) * sizeof(dblResults(1,1))

Therefore, for a partially filled array, i.e. part of the 10000 dimension, when this dimension comes last, the used elements are contiguous.
Should the 10000 dimension come first, say 100 of 10000, the memory usage would be:

100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used

Processing the array in this manner would require:

6x more peel and remainder code segments (the inner and outer loop cannot be fused)
Additional memory pages accessed: from 2 required to 6:12 required (assuming 4KB page size)

While the number of pages required might not incur significant overhead in a simple test program, it may be significant in the actual application. A TLB miss has significant overhead.

Jim Dempsey

gib · ‎08-15-2019

jimdempseyatthecove (Blackbelt) wrote:
dblResults(I,J) is adjacent to dblResults(I+1,J)
whereas
dblResults(I,J) is separated from dblResults(I,J+1) by the number of elements of dim(dblResults,1) * sizeof(dblResults(1,1))
Therefore, for a partially filled array, i.e. part of the 10000 dimension, when this dimension comes last, the used elements are contiguous.

Either my brain is atrophying, or this doesn't make sense. Surely if dblResults(I,J) is adjacent to dblResults(I+1,J), then

dblResults(1:100,J) refers to 100 contiguous locations. Am I out to lunch?

Gib

TimP · ‎08-16-2019

You haven't discussed whether you want an answer independent of choice of compiler or optimization flags. If you use ifort with default options, I would expect the better options to invoke a library memset function call, on the assumption that a majority of the array will be reset. The features of this include automatic alignment adjustment and use of non temporal SSE or AVX instructions. A possible scenario where this may not be optimum would be where you could keep those arrays in cache as you alternate between setting and reusing them. In that case, you might wish to prevent nontemporal stores by directive, although that may not work for the nested loops.

By the way, I have been blocked from reaching the login server from my laptop for weeks now. It's a pain to have access only on the phone. I don't know if that is a feature of the politics of our state (internet access is still political here).

jimdempseyatthecove · ‎08-17-2019

While dblResults(1:100,J) refers to 100 contiguous locations,
dblResults(1:100,J+1) leaps 10000-100 cells from dblResults(1:100,J), well into some other one/two page region of memory.

Thus as you iterate the used portion of the array you have periodic leaps.

Indexed the other way around (and coding with the left most index as the inner most loop) permits any variable portion of the array to be used in a contiguous manner. This will reduce the number of TLB's required to map the used portion of the array, and when accessing the used portion as a whole, will eliminate the inter group loop peel and remainder processing. TLB's are Translation Look-aside Buffers. Each CPU design has a limited number of these.

https://en.wikipedia.org/wiki/Thrashing_(computer_science)#TLB_thrashing:

TLB thrashing

Where the translation lookaside buffer (TLB) acting as a cache for the memory management unit (MMU) which translates virtual addresses to physical addresses is too small for the working set of pages. TLB thrashing can occur even if instruction cache or data cache thrashing are not occurring, because these are cached in different sizes. Instructions and data are cached in small blocks (cache lines), not entire pages, but address lookup is done at the page level. Thus even if the code and data working sets fit into cache, if the working sets are fragmented across many pages, the virtual address working set may not fit into TLB, causing TLB thrashing.

And as an example Xeon Gold 6130 (from http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%206154.html)

64-byte Prefetching
Data TLB: 1-GB pages, 4-way set associative, 4 entries
Data TLB: 4-KB Pages, 4-way set associative, 64 entries
Instruction TLB: 4-KByte pages, 8-way set associative, 64 entries
L2 TLB: 1-MB, 4-way set associative, 64-byte line size
Shared 2nd-Level TLB: 4-KB / 2-MB pages, 6-way associative, 1536 entries. Plus, 1-GB pages, 4-way, 16 entries

Shared amongst 18 cores (~85)

Older Server CPU's had fewer (~64) and desktops even fewer.

Jim Dempsey