- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All,
How can we rewriting the follow code to improve speed ?
I tried all the compiler option and found what was the optimum but still, I would like to have it quicker.
[fortran]
TCARRE=T*T
TCUBE=T*TCARRE
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)
[/fortran]
Thanks for your help,
Vincent
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Read, among others, about the Horner/ Qin Jiushao method:
ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)
P.S.: Corrected in response to TimP below: he pointed out that I had the indices of C in ascending order, whereas the O.P. has them in descending order.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or modified Horner,
ENTH=(C(1)*T+C(2))*TCARRE+(C(3)*T+C(4))
Try with /assume:protect_parens
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vincent,
Are you saying that what you wrote is better than as mecej4 implied ?
ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)
This version has fewer multiplies, but I am wondering if your version has better register retention ?
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
I have always used the approach of mecej4, on the basis of minimal operations.
I was only asking Vincent if his question was based on experience that showed his approach worked better.
Tim's organisation of the equation is interesting, if the compiler can utilise this.
It is difficult to see the practical effect, as if it is being performed many times, then other vectorisation arrangements outside this code snippet could be more relevant.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At assembly level the code (taylor sine approx. by Horner scheme) could look like this:
movups xmm1,veclib1//load of coefficients
mulps xmm0,xmm1
addps xmm7,xmm0 //Xmm7 accumulates the result
movups xmm1,veclib2
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,veclib3
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,veclib4
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,veclib5
For few terms I do not think if one less or one more multiplication does really matter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all, I replaced my code by the code suggested by mecej4 and got 10% speed increase.
Another (but similar) question about solving a 2sd degree equation, is there a way to rewrite this code so that it works faster ?
<fortran>
ABC_Error = 0
IF (A.EQ.0.) THEN
IF (B.EQ.0.) THEN
IF (C.EQ.0.) THEN
ABC_Error = -2
ELSE
ABC_Error = -3
END IF
ELSE
X1 = -C/B
X2 = X1
ABC_Error = 1
END IF
ELSE
D = B*B-4.*A*C
IF (D.LT.0.) THEN
ABC_Error = -1
ELSE IF (D.EQ.0.)
THEN
X1 = -B/(2.*A)
X2 = X1
ELSE
XA = (-B-SIGN(SQRT(D),B))/(2.*A)
XB = C/(A*XA)
X1 = MAX(XA,XB)
X2 = MIN(XA,XB)
END IF
END IF
</fortran>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One cycle of latency per term does not make really any difference.Such a code can be also executed in parallel by two different Ports if there is no interdependencies involved.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Latencies obviously depend on factors which haven't been defined here. It may be easiest to guess latencies if you compile with /QxHost for Haswell. In that case, the sequence mecej4 suggested (making allowance for apparent reversal of the array) would execute 3 fma operations in sequence. The best which could be done, using 2 more registers, would be to reduce this to 2 fma operations in sequence, along with another 2 fp operations which could be done in parallel.
The form posted first would probably take the latency of at least 3 sequential additions (12 cycles?) more than the one posed by mecej4, while also maximizing number of registers tied up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code sequence above was not optimized for newest architecture and was compiled on Centrino processor.
The real concern is the load of coefficients in the form of SoA from memory.Btw on Merom processor addps will execute in 3 cycles(by Agner).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>s there a way to rewrite this code so that it works faster>>>
At least you can try to get rid of division operation and replace it with multiplication by inverted value(of course where it is applicable).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think a major impediment to us providing useful answers is you have expressed your requirements in too narrow of scope.
TCARRE=T*T
TCUBE=T*TCARRE
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)
or
ENTH = ((C(4)*T + C(3))*T + C(2))*T + C(1)
As recommended.
I suspect that your application has (the equivilent of) arrays of these 4-element arrays. The organization of your data will greatly affect the advice you receive. Example:
real(8) :: yourModel(:,:) ! allocated to yourModel(4,nThingies)
And you encapsulate your expression into:
real(8) function foo(C,T)
real(8) :: C(4), T
foo = ((C(4)*T + C(3))*T + C(2))*T + C(1)
end function foo
And you call the funciton
DO I=1, nThingies
ENTH(I) = foo(yourModel(:,I), T)
END DO
Then note that T is invariatant for the duration of the loop
In this case, then consider:
real(8) function foo(C,T)
real(8) :: C(4), T(4)
real(8) :: temp(4)
temp = C * T
foo = temp(1) + temp(2) + temp(3) + temp(4)
end function foo
real(8) :: Tvec(4)
...
Tvec(1) = 1.0_8
Tvec(2) = Tvec(1)*T
Tvec(3) = Tvec(2)*T
Tvec(4) = Tvec(3)*T
DO I=1,nThingies
ENTH(I) = foo(yourModel(:,I), Tvec)
END DO
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Very interesting and valid points.
On your second option for consideration by Vincent, can you please comment on the relative merits and demerits of using DOT_PRODUCT intrinsic in the FOO function, say as follows?
[fortran]
PURE FUNCTION foo(C,T) RESULT(ENTH)
REAL(8), INTENT(IN) :: C(:)
REAL(8), INTENT(IN) :: T(:)
!.. Function result
REAL(KIND(C)) :: ENTH
ENTH = DOT_PRODUCT(C, T)
END FUNCTION foo
[/fortran]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FortranFan has it right, my code was presented to illustrate how you should re-think the problem.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page