- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Recoding an VERY old piece of software (originally vector for a CRAY XMP) to run faster on my PC. The following code is typical of some limiting sequences:

DO j = 1, nelem

W(j) = zero

DO i = 1, nfmaxp2

IF ( CN(i,j)>0 ) W(j) = W(j) + A(i,j)*u(CN(i,j))

ENDDO

ENDDO

Problem is the IF statement and I've tried server variations to speed things up but no real progress. Thought the brain truss here might have some suggestions.

Note: The original code on a particular test case ran 35+hrs on the CRAY. Today on my Haswell CPU PC running 8 threads and using the latest Intel Fortran compiler it takes 6+ minutes but there is still room for improvement.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Two additional variants:

program CNindicies implicit none integer, parameter :: nelem = 10000 integer, parameter :: nfmaxp2 = 100 integer, parameter :: WhateverUis = nfmaxp2 * nelem real, parameter :: zero = 0.0 integer :: i,j integer :: CN(nfmaxp2,nelem) real :: A(nfmaxp2,nelem), W(nelem), U(0:WhateverUis) DO j = 1, nelem BLOCK REAL :: WJ INTEGER :: CNindicies(nfmaxp2) CNindicies = CN(1:nfmaxp2,j) !DIR$ if(.true.) WJ = zero DO i = 1, nfmaxp2 WJ = WJ + A(i,j)*u(CNindicies(i)) ENDDO W(j) = WJ !DIR$ ELSE ! also try W(j) = SUM(A(1:nfmaxp2,j) * U(CNindicies)) !DIR$ ENDIF END BLOCK ENDDO end program CNindicies

Insert into your program (fixup to fit your needs)

Run one test with if(.true.) and a second test with if(.false.)

Jim Dempsey

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Can you assure that CN(I,J) is either > 0 .OR. == 0?

If so, consider dimensioning U(0:formerUpperLimitOfU), and setting U(0)=0

DO j = 1, nelem W(j) = zero DO i = 1, nfmaxp2 W(j) = W(j) + A(i,j)*u(CN(i,j)) ENDDO ENDDO

The above eliminates the IF test and accumulates 0.0 for the missing entries.

The above code will vectorize without branches.

Note, the above will be efficient when a "goodly" portion of CN(I,j) > 0. What "goodly" is, is TBD.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

A variant on Jim's idea is to take advantage of Fortran's willingness to treat the result of a logical operation as an integer:

do j=1,nelem w(j)=zero do i=1,nfmaxp2 w(j)=w(j)-a(i,j)*(CN(i,j)>0) end do end do

Notice the negative sign. If the logical result is true the result is interpreted as -1. if the result is false, it is interpreted as zero. In either case, the If statement (and associated instruction queue flushing) is avoided. Depending on the size of the arrays, it may also be more efficient to establish a vector subset of array CN just before entering the i-loop. Using that vector in the i-loop affords simpler addressing and save some time. Though that advantage may be small compared to the increase in efficient if line 4 is vectorized by the compiler.

David

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

David,

you are missing the **U(**CN(I,J)) factor.

This potentially can be a scaling factor (and now absence thereof).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

David DiLaura wrote:

A variant on Jim's idea is to take advantage of Fortran's willingness to treat the result of a logical operation as an integer:

do j=1,nelem w(j)=zero do i=1,nfmaxp2 w(j)=w(j)-a(i,j)*(CN(i,j)>0) end do end doNotice the negative sign. If the logical result is true the result is interpreted as -1. if the result is false, it is interpreted as zero.

The use of a logical in an expression like this is an extension to the language, one that I expect would result in a diagnostic if standards checking was enabled. The value of the resulting expression interpreted as an integer is processor dependent and will possibly change with things like compile options. I wouldn't recommend this approach.

You can use the merge intrinsic to map a logical value across to an integer value in a portable, conforming manner.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

In your code, you are accessing the array W(j) in the innermost loop, which might be inefficient. What would happen if you do the following:

where ( CN <= 0 ) A = 0 CN = 1 end where DO j = 1, nelem wj = zero DO i = 1, nfmaxp2 wj = wj + A(i,j)*u(CN(i,j)) ENDDO W(j) = wj ENDDO

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I absolutely agree with Ian - do NOT mix integers with logicals! See here for more.

I was wondering if FORALL would help at all here.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Roman,

The original code did not modify array A nor array C. My suggestion was made under the premise that the array CN can be constructed to contain 0 (as opposed to <= 0) for the missing/skipped values. While array U is indirectly addressed, it is a gather operation. The newer CPUs support gather instructions. This can provide for some optimization when you have runs of numbers within CN(I,J) that reside within the same cache line. Use of temp (wj) should not be necessary with optimizing compiler (loop invariant code moved outside of loop).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you gentlemen, appreciate all your feedback and suggestions!

I did implement something along the line of what Jim suggested and it did improve the situation.

You might get chuckle out of this. I looked at Jim's suggestion to dimension U(0:formerUpperLimitOfU) and said to myself, is formerUpperLimitOf a new fortran construct? Oh well, it was early in the morning and that's the excuse I'm sticking to!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Two additional variants:

program CNindicies implicit none integer, parameter :: nelem = 10000 integer, parameter :: nfmaxp2 = 100 integer, parameter :: WhateverUis = nfmaxp2 * nelem real, parameter :: zero = 0.0 integer :: i,j integer :: CN(nfmaxp2,nelem) real :: A(nfmaxp2,nelem), W(nelem), U(0:WhateverUis) DO j = 1, nelem BLOCK REAL :: WJ INTEGER :: CNindicies(nfmaxp2) CNindicies = CN(1:nfmaxp2,j) !DIR$ if(.true.) WJ = zero DO i = 1, nfmaxp2 WJ = WJ + A(i,j)*u(CNindicies(i)) ENDDO W(j) = WJ !DIR$ ELSE ! also try W(j) = SUM(A(1:nfmaxp2,j) * U(CNindicies)) !DIR$ ENDIF END BLOCK ENDDO end program CNindicies

Insert into your program (fixup to fit your needs)

Run one test with if(.true.) and a second test with if(.false.)

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

When processing elements 1:nelem, it is often that arrays of the form CN(i,j) are extremely sparse. There has been no indication of the % non-zero in CN(i,j). If this is very low, an alternative may be to create an array UCN(i1:i2) where UCN(i) = u(CN(i,j)), while noting the range i1:i2 where UCN(I) is non-zero.

Also, it may be worth considering:

* for a given j (or group of j); what is the range of I where CN(i,j) > 0, or

* for a given I (or group of i); what is the range of J where CN(i,j) > 0

Any systematic pattern in CN, of zeros or repeated value could provide significant gains.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Jim, The 2 suggestions you made are equal in performance. Both do help, ~25% decrease in run time.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page