- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Dear All,
How can we rewriting the follow code to improve speed ?
I tried all the compiler option and found what was the optimum but still, I would like to have it quicker.
[fortran]
TCARRE=T*T
TCUBE=T*TCARRE
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)
[/fortran]
Thanks for your help,
Vincent
Link kopiert
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Read, among others, about the Horner/ Qin Jiushao method:
ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)
P.S.: Corrected in response to TimP below: he pointed out that I had the indices of C in ascending order, whereas the O.P. has them in descending order.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Or modified Horner,
ENTH=(C(1)*T+C(2))*TCARRE+(C(3)*T+C(4))
Try with /assume:protect_parens
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Vincent,
Are you saying that what you wrote is better than as mecej4 implied ?
ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)
This version has fewer multiplies, but I am wondering if your version has better register retention ?
John
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Sergey,
I have always used the approach of mecej4, on the basis of minimal operations.
I was only asking Vincent if his question was based on experience that showed his approach worked better.
Tim's organisation of the equation is interesting, if the compiler can utilise this.
It is difficult to see the practical effect, as if it is being performed many times, then other vectorisation arrangements outside this code snippet could be more relevant.
John
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
At assembly level the code (taylor sine approx. by Horner scheme) could look like this:
movups xmm1,veclib1//load of coefficients
mulps xmm0,xmm1
addps xmm7,xmm0 //Xmm7 accumulates the result
movups xmm1,veclib2
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,veclib3
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,veclib4
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,veclib5
For few terms I do not think if one less or one more multiplication does really matter.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Thank you all, I replaced my code by the code suggested by mecej4 and got 10% speed increase.
Another (but similar) question about solving a 2sd degree equation, is there a way to rewrite this code so that it works faster ?
<fortran>
ABC_Error = 0
IF (A.EQ.0.) THEN
IF (B.EQ.0.) THEN
IF (C.EQ.0.) THEN
ABC_Error = -2
ELSE
ABC_Error = -3
END IF
ELSE
X1 = -C/B
X2 = X1
ABC_Error = 1
END IF
ELSE
D = B*B-4.*A*C
IF (D.LT.0.) THEN
ABC_Error = -1
ELSE IF (D.EQ.0.)
THEN
X1 = -B/(2.*A)
X2 = X1
ELSE
XA = (-B-SIGN(SQRT(D),B))/(2.*A)
XB = C/(A*XA)
X1 = MAX(XA,XB)
X2 = MIN(XA,XB)
END IF
END IF
</fortran>
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
One cycle of latency per term does not make really any difference.Such a code can be also executed in parallel by two different Ports if there is no interdependencies involved.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Latencies obviously depend on factors which haven't been defined here. It may be easiest to guess latencies if you compile with /QxHost for Haswell. In that case, the sequence mecej4 suggested (making allowance for apparent reversal of the array) would execute 3 fma operations in sequence. The best which could be done, using 2 more registers, would be to reduce this to 2 fma operations in sequence, along with another 2 fp operations which could be done in parallel.
The form posted first would probably take the latency of at least 3 sequential additions (12 cycles?) more than the one posed by mecej4, while also maximizing number of registers tied up.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
The code sequence above was not optimized for newest architecture and was compiled on Centrino processor.
The real concern is the load of coefficients in the form of SoA from memory.Btw on Merom processor addps will execute in 3 cycles(by Agner).
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
>>>s there a way to rewrite this code so that it works faster>>>
At least you can try to get rid of division operation and replace it with multiplication by inverted value(of course where it is applicable).
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I think a major impediment to us providing useful answers is you have expressed your requirements in too narrow of scope.
TCARRE=T*T
TCUBE=T*TCARRE
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)
or
ENTH = ((C(4)*T + C(3))*T + C(2))*T + C(1)
As recommended.
I suspect that your application has (the equivilent of) arrays of these 4-element arrays. The organization of your data will greatly affect the advice you receive. Example:
real(8) :: yourModel(:,:) ! allocated to yourModel(4,nThingies)
And you encapsulate your expression into:
real(8) function foo(C,T)
real(8) :: C(4), T
foo = ((C(4)*T + C(3))*T + C(2))*T + C(1)
end function foo
And you call the funciton
DO I=1, nThingies
ENTH(I) = foo(yourModel(:,I), T)
END DO
Then note that T is invariatant for the duration of the loop
In this case, then consider:
real(8) function foo(C,T)
real(8) :: C(4), T(4)
real(8) :: temp(4)
temp = C * T
foo = temp(1) + temp(2) + temp(3) + temp(4)
end function foo
real(8) :: Tvec(4)
...
Tvec(1) = 1.0_8
Tvec(2) = Tvec(1)*T
Tvec(3) = Tvec(2)*T
Tvec(4) = Tvec(3)*T
DO I=1,nThingies
ENTH(I) = foo(yourModel(:,I), Tvec)
END DO
Jim Dempsey
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Jim,
Very interesting and valid points.
On your second option for consideration by Vincent, can you please comment on the relative merits and demerits of using DOT_PRODUCT intrinsic in the FOO function, say as follows?
[fortran]
PURE FUNCTION foo(C,T) RESULT(ENTH)
REAL(8), INTENT(IN) :: C(:)
REAL(8), INTENT(IN) :: T(:)
!.. Function result
REAL(KIND(C)) :: ENTH
ENTH = DOT_PRODUCT(C, T)
END FUNCTION foo
[/fortran]
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
FortranFan has it right, my code was presented to illustrate how you should re-think the problem.
Jim Dempsey

- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite