- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Most Efficient Way To Initialize Variables.
Say one has the following variable:
doubleprecision, dimension(6,10000) :: dblResults = 0.0d0
Every analysis cycle one needs to re-zero the dblResults array. The first dimension of 6 is always used to its full extent. The second dimension is “number of members” and is set at 10000 as an upper limit that should never be reached. Assume in this run of the analysis the first 100 positions of the second dimension are actually used, i.e. intNumMem = 100
! which of the following options runs fastest to re-zero the dblResults array for the next analysis cycle and is the preferred way to write the code. (Or is there yet a different and even better way to re-zero the array?)
! option 1
dblResults = 0.0d0
! option 2
dblResults(1:6, 1:intNumMem) = 0.0d0
! option 3
do i = 1, 6
do j = 1, intNumMem
dblResults (i, j) = 0.0d0
end do
end do
! option 4
forall (i=1:6, j=1,intNumMem)
dblResults(i, j) = 0.0d0
end forall
Thank you very much in advance for your comments.
Bob
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
! Option 5 (yet another idea to re-zero the dblResults array for the next use.)
Where (dblResults /= 0.0d0)
dblResults = 0.0d0
End Where
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well option 1 is by far the neatest code and gets my vote. The actual 'saving' by any other method is really going to be very small so why worry about it! When ever I have made timing tests of such things I have had to repeat a very large number of times so be able to see a meaningful time difference like the blink of an eye sort of time.
A better bet is if you can determine at run time how big intNumMem needs to be then make the array allocatable and of the correct size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andrew_4619>> The actual 'saving' by any other method is really going to be very small
Not when
Bob>> “number of members” and is set at 10000 as an upper limit that should never be reached. Assume in this run of the analysis the first 100 positions of the second dimension are actually used, i.e. intNumMem = 100
Option 2 should produce the best results. (should, though check with VTune)
If the time consumed is really an issue, then experiment with passing the contiguous slice of the 2D array to a subroutine taking the shape of a 1D array with size of the 6*intNumMem passed in (IOW as you may have done in Fortran 77). *** This would only be considered if (when) the compiler optimization did NOT use the AVXnnn instructions (or intel_fast_... intrinsic) to wipe the contiguous array (slice).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove (Blackbelt) wrote:
andrew_4619>> The actual 'saving' by any other method is really going to be very smallNot when
Bob>> “number of members” and is set at 10000 as an upper limit that should never be reached. Assume in this run of the analysis the first 100 positions of the second dimension are actually used, i.e. intNumMem = 100
My point was Jim that 1000*verylittletime is still verylittletime to a human. If speed is a real issue using an allocated array of the right size makes more sense to me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Robert,
See excellent responses by Andrew and Jim, both of which can guide you well in terms of 2 considerations: code readability-maintenance and performance. As advised, you may want to look into utilizing the ALLOCATABLE facility that can help you work with right-sized datasets. If that is not possible, you can consider your Option 2, or a variant
dblResults(:,1:intNumMem) = 0.0d0
which informs a reader of your code (who may be you yourself in a future incarnation!) clearly of the array section that is zeroed out!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank y’all for the replies. Fortran has several ways to accomplish equivalent tasks, often for legacy code compatibility. It is unclear sometimes how different code affects speed or what is preferred coding practice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My preference would be for Option 1. The compiler can implement this very efficiently. Option 2 would make sense only if performance analysis showed this initialization to be a bottleneck, which I very much doubt.
Option 3 should have the loops reversed, though the compiler will probably do that for you. Option 4 is probably next to worst (and FORALL is deprecated), option 5 would be worst.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Ret.) (Blackbelt) wrote:My preference would be for Option 1. The compiler can implement this very efficiently. Option 2 would make sense only if performance analysis showed this initialization to be a bottleneck, which I very much doubt.
Option 3 should have the loops reversed, though the compiler will probably do that for you. Option 4 is probably next to worst (and FORALL is deprecated), option 5 would be worst.
I agree. In my testing, the second way is the slowest (maybe). Although the compiler is smart enough to optimize all the situations equally (if possible).
Consider the following four subroutines compiled with ifort 19.0 linux with -O2 -mavx command line
pure subroutine fill1(a,x) real(8), intent(out) :: a(:,:) real(8), intent(in) :: x integer :: n, m, i, j a = x end subroutine pure subroutine fill2(a,x) real(8), intent(out) :: a(:,:) real(8), intent(in) :: x integer :: n, m, i, j n = size(a,1) m = size(a,2) a(1:n,1:m) = x end subroutine pure subroutine fill3(a,x) real(8), intent(out) :: a(:,:) real(8), intent(in) :: x integer :: n, m, i, j n = size(a,1) m = size(a,2) do j=1, m do i=1, n a(i,j) = x end do end do end subroutine pure subroutine fill4(a,x) real(8), intent(out) :: a(:,:) real(8), intent(in) :: x integer :: n, m, i, j n = size(a,1) m = size(a,2) forall(j=1:m, i=1:n) a(i,j)=x end subroutine
produces the following timings (in function calls per second):
76923.08 58823.53 83333.34 76923.08
but the results depend on the order in which the above is called. I have included a burn-in calculation, as well as avoiding the elimination of unused code by doing something with `a` between calls.
Online code & compiler here https://godbolt.org/z/YtNVPn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
JAlexiou wrote:.. I agree. In my testing ..
This test is not at all useful: on most computing environments in this day and age the test, as coded, will simply yield the same values for 'tic' and 'toc' and effectively lead to an output of "Infinity" for the number of functions per second for all 4 cases. This reader should pay attention to Andrew's point in Quote #3, "When ever I have made timing tests of such things I have had to repeat a very large number of times so be able to see a meaningful time difference like the blink of an eye sort of time."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For your test to be somewhat of value:
module foo doubleprecision, dimension(6,10000) :: dblResults = 0.0d0 contains ! your fill subroutines here ... ! timing routine subroutine timeIt(a) use foo use omp_lib integer :: intNumMem doubleprecison :: a(:,:) doubleprecision :: T0, T1 T0 = omp_get_wtime() call fill1(a, 1234.5D0) T1 = omp_get_wtime() print *, size(a,1), size(a,2), T1-T0 ! ditto for fill2, ..., fill4 ... end subroutine timeIt end module foo program use foo integer :: intNumMem intNumMem = 10000 call timeIt(intNumMem) ! throw away this time do intNumMem = 10000,1,-1000 call timeIt(dblResults(1:6,1:intNumMem)) ! assure optimizer does not elede code if(sum(dblResults) == -1.0D0) print *, "Should not print this" end do end program ....
The above is sketch (you debug)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, compare the fill1 with a(1:6,1:10000) to all the rest (to meet the timing question as relating to your post #1)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just my two cents.
The best performance will be achieved for the following code:
real(8), allocatable :: dblResults(:,:) integer :: intNumMem , j allocate(dblResults(10000,6)) intNumMem = 100 do j = 1, 6 dblResults(1:intNumMem , j) = 0.d0 end do
The key here is to have the largest dimension first as this is how Fortran allocates the memory - column by column. You want to access your results in the same way as they are allocated. If you do everything correctly, intel compiler will replace your loops with a library function.
You possibly could further improve performance by vectorizing your code.
Second, somebody had subroutine parameters as
real(8), intent(out) :: a(:,:)
Really bad idea. For out parameters Fortran is allowed to reallocate the memory which becomes a performance issue for large arrays.
Use in or inout for arrays and it will never fail you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
INTENT(OUT) only matters for allocatable array dummy arguments, where the array will be deallocated if it is allocated on entry. If your procedure just writes the array, and you don't want (re)allocation, don't use ALLOCATABLE when declaring the dummy argument.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andriy,
The first dimension is the closest proximity of cells, the last dimension is the furthest. IOW
dblResults(I,J) is adjacent to dblResults(I+1,J)
whereas
dblResults(I,J) is separated from dblResults(I,J+1) by the number of elements of dim(dblResults,1) * sizeof(dblResults(1,1))
Therefore, for a partially filled array, i.e. part of the 10000 dimension, when this dimension comes last, the used elements are contiguous.
Should the 10000 dimension come first, say 100 of 10000, the memory usage would be:
100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used, 9900 cells skipped, 100 cells used
Processing the array in this manner would require:
6x more peel and remainder code segments (the inner and outer loop cannot be fused)
Additional memory pages accessed: from 2 required to 6:12 required (assuming 4KB page size)
While the number of pages required might not incur significant overhead in a simple test program, it may be significant in the actual application. A TLB miss has significant overhead.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove (Blackbelt) wrote:dblResults(I,J) is adjacent to dblResults(I+1,J)
whereas
dblResults(I,J) is separated from dblResults(I,J+1) by the number of elements of dim(dblResults,1) * sizeof(dblResults(1,1))
Therefore, for a partially filled array, i.e. part of the 10000 dimension, when this dimension comes last, the used elements are contiguous.
Either my brain is atrophying, or this doesn't make sense. Surely if dblResults(I,J) is adjacent to dblResults(I+1,J), then
dblResults(1:100,J) refers to 100 contiguous locations. Am I out to lunch?
Gib
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You haven't discussed whether you want an answer independent of choice of compiler or optimization flags. If you use ifort with default options, I would expect the better options to invoke a library memset function call, on the assumption that a majority of the array will be reset. The features of this include automatic alignment adjustment and use of non temporal SSE or AVX instructions. A possible scenario where this may not be optimum would be where you could keep those arrays in cache as you alternate between setting and reusing them. In that case, you might wish to prevent nontemporal stores by directive, although that may not work for the nested loops.
By the way, I have been blocked from reaching the login server from my laptop for weeks now. It's a pain to have access only on the phone. I don't know if that is a feature of the politics of our state (internet access is still political here).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While dblResults(1:100,J) refers to 100 contiguous locations,
dblResults(1:100,J+1) leaps 10000-100 cells from dblResults(1:100,J), well into some other one/two page region of memory.
Thus as you iterate the used portion of the array you have periodic leaps.
Indexed the other way around (and coding with the left most index as the inner most loop) permits any variable portion of the array to be used in a contiguous manner. This will reduce the number of TLB's required to map the used portion of the array, and when accessing the used portion as a whole, will eliminate the inter group loop peel and remainder processing. TLB's are Translation Look-aside Buffers. Each CPU design has a limited number of these.
https://en.wikipedia.org/wiki/Thrashing_(computer_science)#TLB_thrashing:
TLB thrashing
Where the translation lookaside buffer (TLB) acting as a cache for the memory management unit (MMU) which translates virtual addresses to physical addresses is too small for the working set of pages. TLB thrashing can occur even if instruction cache or data cache thrashing are not occurring, because these are cached in different sizes. Instructions and data are cached in small blocks (cache lines), not entire pages, but address lookup is done at the page level. Thus even if the code and data working sets fit into cache, if the working sets are fragmented across many pages, the virtual address working set may not fit into TLB, causing TLB thrashing.
And as an example Xeon Gold 6130 (from http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%206154.html)
64-byte Prefetching
Data TLB: 1-GB pages, 4-way set associative, 4 entries
Data TLB: 4-KB Pages, 4-way set associative, 64 entries
Instruction TLB: 4-KByte pages, 8-way set associative, 64 entries
L2 TLB: 1-MB, 4-way set associative, 64-byte line size
Shared 2nd-Level TLB: 4-KB / 2-MB pages, 6-way associative, 1536 entries. Plus, 1-GB pages, 4-way, 16 entries
Shared amongst 18 cores (~85)
Older Server CPU's had fewer (~64) and desktops even fewer.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page