Community
cancel
Showing results for
Did you mean:
Highlighted
Novice
37 Views

## suggestions for better code

Recoding an VERY old piece of software (originally vector for a CRAY XMP) to run faster on my PC.  The following code is typical of some limiting sequences:

DO j = 1, nelem
W(j) = zero
DO i = 1, nfmaxp2
IF ( CN(i,j)>0 ) W(j) = W(j) + A(i,j)*u(CN(i,j))
ENDDO
ENDDO

Problem is the IF statement and I've tried server variations to speed things up but no real progress.  Thought the brain truss here might have some suggestions.

Note: The original code on a particular test case ran 35+hrs on the CRAY.  Today on my Haswell CPU PC running 8 threads and using the latest Intel Fortran compiler it takes 6+ minutes but there is still room for improvement.

Tags (1)

Accepted Solutions
Highlighted
Black Belt
37 Views

```program CNindicies
implicit none
integer, parameter :: nelem = 10000
integer, parameter :: nfmaxp2 = 100
integer, parameter :: WhateverUis = nfmaxp2 * nelem
real, parameter :: zero = 0.0
integer :: i,j
integer :: CN(nfmaxp2,nelem)
real :: A(nfmaxp2,nelem), W(nelem), U(0:WhateverUis)
DO j = 1, nelem
BLOCK
REAL :: WJ
INTEGER :: CNindicies(nfmaxp2)
CNindicies = CN(1:nfmaxp2,j)
!DIR\$ if(.true.)
WJ = zero
DO i = 1, nfmaxp2
WJ = WJ + A(i,j)*u(CNindicies(i))
ENDDO
W(j) = WJ
!DIR\$ ELSE
! also try
W(j) = SUM(A(1:nfmaxp2,j) * U(CNindicies))
!DIR\$ ENDIF
END BLOCK
ENDDO

end program CNindicies
```

Run one test with if(.true.) and a second test with if(.false.)

Jim Dempsey

11 Replies
Highlighted
Black Belt
37 Views

Can you assure that CN(I,J) is either > 0 .OR. == 0?

If so, consider dimensioning U(0:formerUpperLimitOfU), and setting U(0)=0

```DO j = 1, nelem
W(j) = zero
DO i = 1, nfmaxp2
W(j) = W(j) + A(i,j)*u(CN(i,j))
ENDDO
ENDDO```

The above eliminates the IF test and accumulates 0.0 for the missing entries.
The above code will vectorize without branches.

Note, the above will be efficient when a "goodly" portion of CN(I,j) > 0. What "goodly" is, is TBD.

Jim Dempsey

Highlighted
Beginner
37 Views

A variant on Jim's idea is to take advantage of Fortran's willingness to treat the result of a logical operation as an integer:

```do j=1,nelem
w(j)=zero
do i=1,nfmaxp2
w(j)=w(j)-a(i,j)*(CN(i,j)>0)
end do
end do```

Notice the negative sign. If the logical result is true the result is interpreted as -1. if the result is false, it is interpreted as zero. In either case, the If statement (and associated instruction queue flushing) is avoided. Depending on the size of the arrays, it may also be more efficient to establish a vector subset of array CN just before entering the i-loop. Using that vector in the i-loop affords simpler addressing and save some time. Though that advantage may be small compared to the increase in efficient if line 4 is vectorized by the compiler.

David

Highlighted
Black Belt
37 Views

David,

you are missing the U(CN(I,J)) factor.

This potentially can be a scaling factor (and now absence thereof).

Jim Dempsey

Highlighted
Black Belt
37 Views

David DiLaura wrote:

A variant on Jim's idea is to take advantage of Fortran's willingness to treat the result of a logical operation as an integer:

```do j=1,nelem
w(j)=zero
do i=1,nfmaxp2
w(j)=w(j)-a(i,j)*(CN(i,j)>0)
end do
end do```

Notice the negative sign. If the logical result is true the result is interpreted as -1. if the result is false, it is interpreted as zero.

The use of a logical in an expression like this is an extension to the language, one that I expect would result in a diagnostic if standards checking was enabled.  The value of the resulting expression interpreted as an integer is processor dependent and will possibly change with things like compile options.  I wouldn't recommend this approach.

You can use the merge intrinsic to map a logical value across to an integer value in a portable, conforming manner.

Highlighted
New Contributor I
37 Views

In your code, you are accessing the array W(j) in the innermost loop, which might be inefficient.  What would happen if you do the following:

```where ( CN <= 0 )
A = 0
CN = 1
end where

DO j = 1, nelem
wj = zero
DO i = 1, nfmaxp2
wj = wj + A(i,j)*u(CN(i,j))
ENDDO
W(j) = wj
ENDDO

```
Highlighted
Black Belt Retired Employee
37 Views

I absolutely agree with Ian - do NOT mix integers with logicals! See here for more.

I was wondering if FORALL would help at all here.

Highlighted
Black Belt
37 Views

Roman,

The original code did not modify array A nor array C. My suggestion was made under the premise that the array CN can be constructed to contain 0 (as opposed to <= 0) for the missing/skipped values. While array U is indirectly addressed, it is a gather operation. The newer CPUs support gather instructions. This can provide for some optimization when you have runs of numbers within CN(I,J) that reside within the same cache line. Use of temp (wj) should not be necessary with optimizing compiler (loop invariant code moved outside of loop).

Jim Dempsey

Highlighted
Novice
37 Views

Thank you gentlemen, appreciate all your feedback and suggestions!

I did implement something along the line of what Jim suggested and it did improve the situation.

You might get chuckle out of this. I looked at Jim's suggestion to dimension U(0:formerUpperLimitOfU) and said to myself, is formerUpperLimitOf a new fortran construct?  Oh well, it was early in the morning and that's the excuse I'm sticking to!

Highlighted
Black Belt
38 Views

```program CNindicies
implicit none
integer, parameter :: nelem = 10000
integer, parameter :: nfmaxp2 = 100
integer, parameter :: WhateverUis = nfmaxp2 * nelem
real, parameter :: zero = 0.0
integer :: i,j
integer :: CN(nfmaxp2,nelem)
real :: A(nfmaxp2,nelem), W(nelem), U(0:WhateverUis)
DO j = 1, nelem
BLOCK
REAL :: WJ
INTEGER :: CNindicies(nfmaxp2)
CNindicies = CN(1:nfmaxp2,j)
!DIR\$ if(.true.)
WJ = zero
DO i = 1, nfmaxp2
WJ = WJ + A(i,j)*u(CNindicies(i))
ENDDO
W(j) = WJ
!DIR\$ ELSE
! also try
W(j) = SUM(A(1:nfmaxp2,j) * U(CNindicies))
!DIR\$ ENDIF
END BLOCK
ENDDO

end program CNindicies
```

Run one test with if(.true.) and a second test with if(.false.)

Jim Dempsey

Highlighted
New Contributor II
37 Views

When processing elements 1:nelem, it is often that arrays of the form CN(i,j) are extremely sparse. There has been no indication of the % non-zero in CN(i,j). If this is very low, an alternative may be to create an array UCN(i1:i2) where UCN(i) = u(CN(i,j)), while noting the range i1:i2 where UCN(I) is non-zero.

Also, it may be worth considering:
* for a given j (or group of j); what is the range of I where CN(i,j) > 0, or
* for a given I (or group of i); what is the range of J where CN(i,j) > 0

Any systematic pattern in CN, of zeros or repeated value could provide significant gains.

Highlighted
Novice
37 Views

Jim,  The 2 suggestions you made are equal in performance.  Both do help, ~25% decrease in run time.