Solved: suggestions for better code

Ralph_Nelson · ‎07-26-2017

Recoding an VERY old piece of software (originally vector for a CRAY XMP) to run faster on my PC. The following code is typical of some limiting sequences:

               DO j = 1, nelem
                  W(j) = zero
                  DO i = 1, nfmaxp2
                     IF ( CN(i,j)>0 ) W(j) = W(j) + A(i,j)*u(CN(i,j))
                  ENDDO
               ENDDO

Problem is the IF statement and I've tried server variations to speed things up but no real progress. Thought the brain truss here might have some suggestions.

Note: The original code on a particular test case ran 35+hrs on the CRAY. Today on my Haswell CPU PC running 8 threads and using the latest Intel Fortran compiler it takes 6+ minutes but there is still room for improvement.

jimdempseyatthecove · ‎07-27-2017

Two additional variants:

program CNindicies
    implicit none
    integer, parameter :: nelem = 10000
    integer, parameter :: nfmaxp2 = 100
    integer, parameter :: WhateverUis = nfmaxp2 * nelem
    real, parameter :: zero = 0.0
    integer :: i,j
    integer :: CN(nfmaxp2,nelem)
    real :: A(nfmaxp2,nelem), W(nelem), U(0:WhateverUis)
    DO j = 1, nelem
        BLOCK
            REAL :: WJ
            INTEGER :: CNindicies(nfmaxp2)
            CNindicies = CN(1:nfmaxp2,j)
!DIR$ if(.true.)
            WJ = zero
            DO i = 1, nfmaxp2
                WJ = WJ + A(i,j)*u(CNindicies(i))
            ENDDO
            W(j) = WJ
!DIR$ ELSE
            ! also try
            W(j) = SUM(A(1:nfmaxp2,j) * U(CNindicies))
!DIR$ ENDIF
        END BLOCK
    ENDDO
    
end program CNindicies

Insert into your program (fixup to fit your needs)

Run one test with if(.true.) and a second test with if(.false.)

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎07-26-2017

Can you assure that CN(I,J) is either > 0 .OR. == 0?

If so, consider dimensioning U(0:formerUpperLimitOfU), and setting U(0)=0

DO j = 1, nelem
  W(j) = zero
  DO i = 1, nfmaxp2
    W(j) = W(j) + A(i,j)*u(CN(i,j))
  ENDDO
ENDDO

The above eliminates the IF test and accumulates 0.0 for the missing entries.
The above code will vectorize without branches.

Note, the above will be efficient when a "goodly" portion of CN(I,j) > 0. What "goodly" is, is TBD.

Jim Dempsey

David_DiLaura1 · ‎07-26-2017

A variant on Jim's idea is to take advantage of Fortran's willingness to treat the result of a logical operation as an integer:

do j=1,nelem
   w(j)=zero
   do i=1,nfmaxp2
      w(j)=w(j)-a(i,j)*(CN(i,j)>0)
   end do
end do

Notice the negative sign. If the logical result is true the result is interpreted as -1. if the result is false, it is interpreted as zero. In either case, the If statement (and associated instruction queue flushing) is avoided. Depending on the size of the arrays, it may also be more efficient to establish a vector subset of array CN just before entering the i-loop. Using that vector in the i-loop affords simpler addressing and save some time. Though that advantage may be small compared to the increase in efficient if line 4 is vectorized by the compiler.

David

jimdempseyatthecove · ‎07-26-2017

David,

you are missing the U(CN(I,J)) factor.

This potentially can be a scaling factor (and now absence thereof).

Jim Dempsey

IanH · ‎07-26-2017

David DiLaura wrote:

A variant on Jim's idea is to take advantage of Fortran's willingness to treat the result of a logical operation as an integer:
do j=1,nelem
   w(j)=zero
   do i=1,nfmaxp2
      w(j)=w(j)-a(i,j)*(CN(i,j)>0)
   end do
end do
Notice the negative sign. If the logical result is true the result is interpreted as -1. if the result is false, it is interpreted as zero.

The use of a logical in an expression like this is an extension to the language, one that I expect would result in a diagnostic if standards checking was enabled. The value of the resulting expression interpreted as an integer is processor dependent and will possibly change with things like compile options. I wouldn't recommend this approach.

You can use the merge intrinsic to map a logical value across to an integer value in a portable, conforming manner.

Roman1 · ‎07-26-2017

In your code, you are accessing the array W(j) in the innermost loop, which might be inefficient. What would happen if you do the following:

where ( CN <= 0 )
   A = 0
   CN = 1
end where

DO j = 1, nelem
   wj = zero
   DO i = 1, nfmaxp2
      wj = wj + A(i,j)*u(CN(i,j))
   ENDDO
   W(j) = wj
ENDDO

Steve_Lionel · ‎07-26-2017

I absolutely agree with Ian - do NOT mix integers with logicals! See here for more.

I was wondering if FORALL would help at all here.

jimdempseyatthecove · ‎07-26-2017

Roman,

The original code did not modify array A nor array C. My suggestion was made under the premise that the array CN can be constructed to contain 0 (as opposed to <= 0) for the missing/skipped values. While array U is indirectly addressed, it is a gather operation. The newer CPUs support gather instructions. This can provide for some optimization when you have runs of numbers within CN(I,J) that reside within the same cache line. Use of temp (wj) should not be necessary with optimizing compiler (loop invariant code moved outside of loop).

Jim Dempsey

Ralph_Nelson · ‎07-27-2017

Thank you gentlemen, appreciate all your feedback and suggestions!

I did implement something along the line of what Jim suggested and it did improve the situation.

You might get chuckle out of this. I looked at Jim's suggestion to dimension U(0:formerUpperLimitOfU) and said to myself, is formerUpperLimitOf a new fortran construct? Oh well, it was early in the morning and that's the excuse I'm sticking to!

jimdempseyatthecove · ‎07-27-2017

Two additional variants:

program CNindicies
    implicit none
    integer, parameter :: nelem = 10000
    integer, parameter :: nfmaxp2 = 100
    integer, parameter :: WhateverUis = nfmaxp2 * nelem
    real, parameter :: zero = 0.0
    integer :: i,j
    integer :: CN(nfmaxp2,nelem)
    real :: A(nfmaxp2,nelem), W(nelem), U(0:WhateverUis)
    DO j = 1, nelem
        BLOCK
            REAL :: WJ
            INTEGER :: CNindicies(nfmaxp2)
            CNindicies = CN(1:nfmaxp2,j)
!DIR$ if(.true.)
            WJ = zero
            DO i = 1, nfmaxp2
                WJ = WJ + A(i,j)*u(CNindicies(i))
            ENDDO
            W(j) = WJ
!DIR$ ELSE
            ! also try
            W(j) = SUM(A(1:nfmaxp2,j) * U(CNindicies))
!DIR$ ENDIF
        END BLOCK
    ENDDO
    
end program CNindicies

Insert into your program (fixup to fit your needs)

Run one test with if(.true.) and a second test with if(.false.)

Jim Dempsey

John_Campbell · ‎07-28-2017

When processing elements 1:nelem, it is often that arrays of the form CN(i,j) are extremely sparse. There has been no indication of the % non-zero in CN(i,j). If this is very low, an alternative may be to create an array UCN(i1:i2) where UCN(i) = u(CN(i,j)), while noting the range i1:i2 where UCN(I) is non-zero.

Also, it may be worth considering:
* for a given j (or group of j); what is the range of I where CN(i,j) > 0, or
* for a given I (or group of i); what is the range of J where CN(i,j) > 0

Any systematic pattern in CN, of zeros or repeated value could provide significant gains.

Ralph_Nelson · ‎07-30-2017

Jim, The 2 suggestions you made are equal in performance. Both do help, ~25% decrease in run time.