Create an INTERFACE or a FUNCTION with OPTIONAL arguments

NotThatItMatters · ‎07-22-2015

This is a question of programming style and effectiveness (speed). I have the option of creating two possible array accessors:

MODULE Stuff
INTERFACE Get_Val
    MODULE PROCEDURE Get_Val_One, Get_Val_Three
END INTERFACE Get_Val
INTEGER, PRIVATE :: II, JJ, KK, IJK
INTEGER, PARAMETER :: F = SELECTED_REAL_KIND(6, 30)
REAL (KIND = F), DIMENSION(:), ALLOCATABLE, PRIVATE :: ValArray

CONTAINS
REAL (KIND = F) FUNCTION Get_Val_One(IDX)
    INTEGER, INTENT(IN) :: IDX

    Get_Val_One = ValArray(IDX)
END FUNCTION Get_Val_One
REAL (KIND = F) FUNCTION Get_Val_Three(I, J, K)
    INTEGER, INTENT(IN) :: I, J, K

    INTEGER :: IDX
    IDX = ((K - 1) * JJ + J - 1) * II + I
    Get_Val_Three = ValArray(IDX)
END FUNCTION Get_Val_Three
!  Other functions setting values of the ValArray and one- and three-dimensional array bounds
END MODULE Stuff

The other array accessor would be the following:

MODULE OtherStuff
INTEGER, PRIVATE :: II, JJ, KK, IJK
INTEGER, PARAMETER :: F = SELECTED_REAL_KIND(6, 30)
REAL (KIND = F), DIMENSION(:), ALLOCATABLE, PRIVATE :: ValArray

CONTAINS
REAL (KIND = F) FUNCTION Get_Val(I, J, K)
    INTEGER, INTENT(IN) :: I
    INTEGER, INTENT(IN), OPTIONAL :: J, K

    INTEGER :: IDX

    IF (PRESENT(J) .AND. PRESENT(K)) THEN
        IDX = ((K - 1) * JJ + J - 1) * II + I
    ELSE
        IDX = I
    END IF
    Get_Val = ValArray(IDX)
END FUNCTION Get_Val
!  Other routines setting ValArray values and one- and three-dimensional array bounds
END MODULE OtherStuff

Is there any difference speed-wise and memory-footprint wise between the two options. Which option might be preferable?

jimdempseyatthecove · ‎07-22-2015

In the first case, the compiler resolves the test for number of indexes. In the second case, when the code is not inlined, the number of argument tests incurs a test and branch (some overhead). When inlined, you would expect the compiler to elide the dead code and be as efficient as the first method (but there is no guarantee that this would be the case). I would use the first method.

Jim Dempsey

mecej4 · ‎07-22-2015

If you use the first version, how you do the selection between calling with one argument and calling with three arguments is what will impact efficiency most.

Presumably, there are some nested i-j-k loops from within which the functions will be called. The test for selection should be in the outermost loop possible. If, in fact, the test can be made completely outside the loops, that would be even better.

The second version is easier to understand, but it requires the test to be made every time. It will probably not be efficient. However, efficiency would be an issue only if the functions are expected to be called millions of times or more.

NotThatItMatters · ‎07-22-2015

The methods also beg the question as to whether to define the array as one- or three-dimensional. The one-dimensional form seemed natural. The code I have is using the INTERFACE form, with the multiplicative index replaced by an actual INTEGER array value from another module. Again I am uncertain if this is helping, but the numbers are on the order of 10^6 calls for a simulation.

jimdempseyatthecove · ‎07-22-2015

Mecej4,

The forst method is by way of generic interface. Both the following are valid with the first method:

X = Get_Val(I)

Y = Get_Val(I,J,K)

The compiler choses the appropriate function (Get_Val_One or Get_Val_Three) based on arguments.

Jim Dempsey

mecej4 · ‎07-22-2015

Jim, I understood that. What I meant was: the efficiency depends on whether the calls are coded as

1. Efficient version:

IF (condition) THEN

do i=1,N

X(i) = Get_Val(I)

end do

ELSE

do i=1,N1; do j=1,N2; do k=1,N3

Y(i,j,k) = Get_Val(i,j,k)

end do; end do; end do

OR

2. Less efficient:

do i=1,N1; do j=1,N2; do k=1,N3

IF(cond)THEN

X(i)=Get_Val(i)

ELSE

Y(i,j,k) = Get_Val(i,j,k)

ENDIF

end do; end do; end do

With trivial code such as this, the optimizer may be able to make the "less efficient" version equally efficient as the other one, but with more complex code with hundred of lines can it do the same?

jimdempseyatthecove · ‎07-23-2015

Any sane programmer would not program al la 2. You could potentially have N2*N3 number of

X(i) = Get_Val(i)

for a given i (or multiple i's for that matter NotNotThatItMatters)

While the compiler may be able to rearrange the code, especially with loop invariant code, why rely on the compiler to fix your mistakes in programming?

Back to the original post.

I have a program that I am consulting on for optimization that has a collection of 4D arrays, at least 7 such arrays, all allocated to the same dimensions. A significant portion of time is spent in the subscript overhead of extracting from each array descriptor the same values to compute the same relative offset from the allocation. By use of a similar technique, the 1D equivalent index can be calculated once, then use (re used) 6 more times.

Steve (if you are following),

In thinking about this project that I am working on, it seems to me that it is not unusual for applications to have multiple multi-dimension arrays of the same rank and extents and for the program to contain a series of expressions containing the same indices. In situations like this, it may be beneficial to have a compiler directive to aid in optimization:

!*** HYPOTHETICAL ***
!DIR$ ASSUME_SAME_RANK_AND_SIZE :: FEE, FI, FO, FUM
X = FEE(I,J,K,L) + FI (I,J,K,L)+ FO(I,J,K,L) + FUM(I,J,K,L)

Then the above could compute the relative index once, and re-use it 3 additional times. In the above example, the array subscripting overhead for each array might be ~4x that of the overhead to use the value (assuming in cache). This feature, for the above statement, might yield a 2x improvement in performance (for that statement). Weights 5+5+5+5 verses 5+2+2+2

Jim Dempsey

Steven_L_Intel1 · ‎07-23-2015

Interesting idea, Jim, but I'm uncertain how that would work with vectorization and parallelization.

jimdempseyatthecove · ‎07-23-2015

I do not see any issue with vectorization nor parallelization. In both cases (vector lanes and/or array slices) use the same indicies. With a little re-work of the above statement we have

DO I=1,N
X(I) = FEE(I,J,K,L) + FI (I,J,K,L)+ FO(I,J,K,L) + FUM(I,J,K,L)
ENDDO

Loosely vectorized to

DO I=1,N,8
X(I:I+7) = FEE(I:I+7,J,K,L) + FI (I:I+7,J,K,L)+ FO(I:I+7,J,K,L) + FUM(I:I+7,J,K,L)
ENDDO

And in parallel form

!$OMP PARALLEL DO
DO I=1,N
   ! Thread 0 has I as 1:N/nThreads
   ! Thread 1 has I as 1+N/nThreads:2*N/nThreads
   ! ...
   X(I) = FEE(I,J,K,L) + FI (I,J,K,L)+ FO(I,J,K,L) + FUM(I,J,K,L)
ENDDO

All array references using the same relative offsets. The code would only need to reach into one of the array descriptors to calculate the relative offset, and do this once, then for each reference, add this relative offset to the base address (also held in the array descriptor). This assumes the availability of the directive discussed above. As you do with ASSUME_ALIGNED, it is the programmer's responsibility to not mislead the compiler by giving a false directive. The only thing keeping the compiler from doing this now, is it (usually) does not know the UBOUNDs and LBOUNDs are the same. When these arrays are declared statically (fixed dimensions), I venture to assume the compiler does collapse the relative subscript calculation to occur but once (for the above scenario). i.e., the only thing missing is a means for the compiler to know these arrays are dimensioned the same... ergo the proffered !DIR$ --- eh.

Jim Dempsey

NotThatItMatters · ‎07-24-2015

I must point out Jim's comment is the case in my code as well. There are times I need to access multiple arrays at the same relative index simultaneously. The old school method was to turn the 3-D arrays into 1-D arrays and step linearly through them, thereby maintaining the relative index. Without the compiler directive, that would appear to be the only way in.