Mapping/accessing memory from derived data types

gregory_p_herrick · ‎02-26-2007

I am working with a Fortran 95 which originally had many data structures of the "rectangular" form:

Original Code

integer :: i,j,k,ni,nj,nk
real*8,allocatable,dimension(:,:,:) :: x
::::Define ni,nj,nk::::
allocate (x(1:ni,1:nj,1:nk))

Rectangular arrays always performed quite acceptably in earlier applications of the code. Desires for more complicated applications of the code have necessitated implementation of new data structures; it seems to me that derived data types are most appropriate for functionality, code legibility, efficient memory management, and execution speed for the aforementioned more complicated applications of the code.

I have subsequently implemented code of the general format:

New Code

type v3r8
real*8,allocatable,dimension(:,:,:) :: v
end type v3r8
type (v3r8),allocatable,dimension(:) :: x
integer :: i,j,k,l,nl
integer,allocatable,dimension(:) :: ni,nj,nk
::::Define nl::::
allocate(ni(1:nl),nj(1:nl),nk(1:nl))
::::Define ni,nj,nk for each nl::::
allocate(x(1:nl))
do l=1,nl
   allocate(x(l)%v(1:ni(l),1:nj(l),1:nk(l)))
enddo

(With both codes, there are many subsequent arithmetic operations wherein the values of x must be dereferenced)

(Because the values within the ni [or nj or nk] array can vary greatly, adding a 4th dimension to the x-array to account for the l-indexing could be very inefficient with memory management, thus the derived data type approach here. I broached these issues and received helpful assistance in this forum previously.)

This strategy is proving to be very -- but not perfectly -- efficient in memory management; i.e., little -- but not zero -- "wasted"/"unnecessary"/"extra" memory allocated.

However, the new code is exhibiting a 30% time increase in execution when I run "identical", "simple" cases which could be run on both the original code and my modified code.

Admittedly I am working with a very large, multiple-routine F95/MPI code with many modifications from its initial state (and thus I cannot claim to have scientifically/absolutely isolated the critical change), I am theorizing that my modifications to these "fundamental" data structures are responsible for the adverse effects on performance.

An aside I think may actually be relevant: I observe through use of Intel's "sizeof" function that my "x" variable of type "v3r8" (above) has an "overhead" of 24+12*3=60 bytes above and beyond the 8 bytes per address allocated within the v(:,:,:) component of x(l). (I have experimented with different sized arrays and have [seemingly] determined that the overhead is 24 bytes + 12*RANK of each component, hence 24+12*3=60 here; Steve clarified this in a thread of mine previously).

What is stored in these "overhead" memory bytes? Could they contribute to poor performance (time)?

I write here to propose/question the following:

I propose these "overhead" bytes are storing pointers/array maps for the components of the derived data type. I further propose that when variables of the type "v3r8" such as "x" (above) are referenced, that some time/efficiency (30% time?) is lost as the pointer/memory map is deciphered so that the desired value can be properly dereferenced.

(It is true that with rectangular arrays, there is no "overhead": only the exact amount of memory to account for the size/quantity of data addresses is allocated. Rectangular arrays are stored in contiguous memory, and the array-dimensions are deciphered with pointer-arithmetic at the machine-level. Correct?)

(It is further true that the rectangular arrays which may be components of a derived data type contain no "overhead" with respect to themselves... that they are also one contiguous swath of memory; rather, their overhead is with respect to the governing derived data type and my aforementioned proposed description/understanding of the memory mapping/dereferencing process [i.e, the entire data structure does not necessarily occupy contiguous memory and thus requires more "arduous" dereferencing]. Correct?)

Does my scenario make sense?

Any and all clarification and elucidation on this subject would be greatly appreciated.

Thank you, Greg

Steven_L_Intel1 · ‎02-26-2007

Some terminology first. What you refer to as "rectangular arrays" are known as "explicit shape arrays" in Fortran. It is true that there is no overhead associated with these, either by themselves or in a derived type.

When you have an array with bounds such as (:,:,:), this is an "assumed shape array" where the bounds are not defined until run-time. There is a descriptor data structure associated with these arrays that holds the bounds information - the size varies by the rank (number of dimensions) of the array - if I recall correctly, 44 bytes are used for a one-dimension array (the descriptor holds the base address, element size, A0 offset, and for each dimension, lower bound, upper bound and stride.

Yes, there will be a performance penalty for an assumed-shape array in a data structure as there are multiple memory references. 30% seems extreme to me, but a lot depends on what you're doing with these arrays.

The first thing you should look at is how you are traversing these arrays - do you tend to go through them in memory order, or are you skipping around? Would rearranging the dimensions help improve locality of access? Also, sometimes rather than a derived type with several arrays, you might find it more efficient to have arrays of derived types - it depends on your application.

gregory_p_herrick · ‎02-28-2007

Steve, thank you very much for the quick reply and the advisement. I have since tested my hypothesis, and it is proving true: Yes, I was suffering 30% speed penalties "globally" due to this issue -- with a few routines taking 200%of optimal time and one routine taking 400% of optimal time to run.

I performed this test on one machine: an SGI Altix/ Intel Itanium2. Do you think "slowdowns" of these magnitudes are architecture or installation specific? I will transport my code to other platforms, now confident I have greater execution efficiency. A very puzzling, frustrating problem, but I am confident I have diagnosed and rectified it.

Thanks.

Steven_L_Intel1 · ‎03-01-2007

I don't think it's architecture or installation specific, but without seeing the actual application it's hard to even guess. I do know that until recent updates the compiler did not do overlap detection for pointer/alliocatable arrays in derived types and this could cause extra copies to be made.

jimdempseyatthecove · ‎03-01-2007

Greg,

In your original code x may have been the only "rectangular" array used by your application. x may have beed declared in a COMMON or in a MODULE and used throught your application. That is, x may not have been passed into upper level subroutines (x or portions of x may have been passed into lower level generic subroutines such as matrix manipulation).

I assume your new code changes result from the desire to process on more than one x. Therefore you created an array of arrays.

The performance degredation problem you are experiencing is likely due to the manner on how you are referencing this array of arrays.

A performance poor method of dereference is

do iX=1,nX
 do ii=1,ni
 do ij=1,nj
 do ik=1,nk
 ! series of expressions containing
 ! x(ix)%v(ii,ij,ik)
 x(ix)%v(ii,ij,ik) = ...
 yyy = zzz * x(ix)%v(ii,ij,ik)
 ...
 end do
 end do
 end do
end do

Or you may be inclined to pass the x index iX
throught the levels of subroutine calls.

Whereas a performance improved method is

do iX=1,nX
 call DoX(x(ix)%v)
end do
...
subroutine DoX(x)
 real*8 :: x(:,:,:)
 do ii=1,ni
 do ij=1,nj
 do ik=1,nk
 ! series of expressions containing
 ! x(ii,ij,ik)
 x(ii,ij,ik) = ...
 yyy = zzz * x(ii,ij,ik)
 ...
 end do
 end do
 end do
end subroutine DoX

This method is to pass the reference to the desired x in the array of x

The changes to your old code to use the performance improved method is trivial.

Pass x by reference to all subroutines and functions that reference x.

Jim Dempsey

gregory_p_herrick · ‎03-01-2007

Thank you, Jim. What you have listed with sample code is *exactly* what I have been implementing in my code to realize the "optimal" times I referenced in my second post. Yes, initially, I was dereferencing variables of derived type stored in modules; now I am merely passing references to the components of each x(ix)% to my subroutines. As you state, the changes to the code were rather trivial, but the performance improvement was dramatic.

jimdempseyatthecove · ‎03-01-2007

Greg,

Now that you have observed the performance benefit of passing a reference to a section of a set of data (one of the x in an array of Xs), you might check your three dimensional array code to see if it tends to reference one of the dimensions much more than the others. If so, then by properly organizing the array you can pass references to these smaller sections of the array.

The trick is in keeping the data in the smaller reference in adjacent memory, thus avoiding the need for the creation of and copy to/from a temporary array. In defining the "rectangular" array, make the left most index the index that is use best (most often) when manipulating a single index. Make the next (middle) index that is use best (most often) when two indexes alone reference the array and where the other index is the first index.

Thus DoX1(x(:, ij, ik)) passes a reference to a rank 1 array for use in rank 1 optimal code.
subroutine DoX1(x)
real(8) ::x(:)

DoX2(x(:, :, ik)) passes a reference to a rank 2 array for use in rank 2 optimal code.
subroutine DoX2(x)
real(8) ::x(:,:)

Check to see if the compiler is generating unnecessary temporaries, if it is, then often eliminate these temporaries by use of a pointer of the desired rank, or by declaring the types using pointer in place of allocatable.

If your program gets used a lot, and if the run times are significant, then it may be well worth your time to address performance issues.

A couple of years ago I had address a similar situation in a complex application (700+ source files), where flexibility in specifying the quantityof, and dimensions ofobjects being manipulated had to be improved. The program had old F77 commons in a ridged and somewhat stubborn fixed format. The unsubstantiated fear was by introducing allocatable arrays and pointers that the resultant code, though maybe more flexible, would be slower. By the careful introduction into the code of the newer F90 features the code ran faster. And due to references being passed into the upper level functions, it made it relatively easy to introduce OpenMP concepts to parallelize the code. The net result was an unbounded form of the former application that ran 40x faster on a 4 core system.

For my research, even with the faster code, my simulation runs can take 100's of hours. Using the unmodified code would have been unworkable.

Jim Dempsey