One cycle of latency per term

Rudy__Delcroit · ‎06-20-2013

Dear All,

How can we rewriting the follow code to improve speed ?

I tried all the compiler option and found what was the optimum but still, I would like to have it quicker.

[fortran]

TCARRE=T*T

TCUBE=T*TCARRE

ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)

[/fortran]

Thanks for your help,

Vincent

mecej4 · ‎06-20-2013

Read, among others, about the Horner/ Qin Jiushao method:

ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)

P.S.: Corrected in response to TimP below: he pointed out that I had the indices of C in ascending order, whereas the O.P. has them in descending order.

TimP · ‎06-20-2013

Or modified Horner,

ENTH=(C(1)*T+C(2))*TCARRE+(C(3)*T+C(4))

Try with /assume:protect_parens

John_Campbell · ‎06-20-2013

Vincent,

Are you saying that what you wrote is better than as mecej4 implied ?

ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)

This version has fewer multiplies, but I am wondering if your version has better register retention ?

John

SergeyKostrov · ‎06-20-2013

>>...This version has fewer multiplies, but I am wondering if your version has better register retention?.. John, Even if there are less multiplications in that equation you don't take into account that at a lowest level there are Latencies, Throughputs ( some calculations could be executed in parallel ) for instructions and Pipelines in a CPU. So, for a really small equation with a very limited number of degrees I wouldn't worry. Is that really critical for a 3rd degree polynomial? If you think that you're right go practical and prove that there is even a small difference. I hope that you've heard that expression from Mr. Spock in Star Trek movie: Any difference that makes no difference, is no difference...

John_Campbell · ‎06-21-2013

Sergey,

I have always used the approach of mecej4, on the basis of minimal operations.
I was only asking Vincent if his question was based on experience that showed his approach worked better.
Tim's organisation of the equation is interesting, if the compiler can utilise this.

It is difficult to see the practical effect, as if it is being performed many times, then other vectorisation arrangements outside this code snippet could be more relevant.

John

Bernard · ‎06-21-2013

At assembly level the code (taylor sine approx. by Horner scheme) could look like this:

                    movups xmm1,veclib1//load of coefficients
                    mulps xmm0,xmm1
                    addps xmm7,xmm0 //Xmm7 accumulates the result
                    movups xmm1,veclib2
                    mulps xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib3
                    mulps xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib4
                    mulps xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib5

For few terms I do not think if one less or one more multiplication does really matter.

Rudy__Delcroit · ‎06-21-2013

Thank you all, I replaced my code by the code suggested by mecej4 and got 10% speed increase.

Another (but similar) question about solving a 2sd degree equation, is there a way to rewrite this code so that it works faster ?

ABC_Error = 0

IF (A.EQ.0.) THEN

IF (B.EQ.0.) THEN

IF (C.EQ.0.) THEN

ABC_Error = -2

ELSE

ABC_Error = -3

END IF

ELSE

X1 = -C/B

X2 = X1

ABC_Error = 1

END IF

ELSE

D = B*B-4.*A*C

IF (D.LT.0.) THEN

ABC_Error = -1

ELSE IF (D.EQ.0.)

THEN

X1 = -B/(2.*A)

X2 = X1

ELSE

XA = (-B-SIGN(SQRT(D),B))/(2.*A)

XB = C/(A*XA)

X1 = MAX(XA,XB)

X2 = MIN(XA,XB)

END IF

</fortran>

SergeyKostrov · ‎06-21-2013

>>...It is difficult to see the practical effect, as if it is being performed many times... I disagree with that because I tested many polynomials of higher degree during last two years.

Bernard · ‎06-21-2013

One cycle of latency per term does not make really any difference.Such a code can be also executed in parallel by two different Ports if there is no interdependencies involved.

TimP · ‎06-21-2013

Latencies obviously depend on factors which haven't been defined here. It may be easiest to guess latencies if you compile with /QxHost for Haswell. In that case, the sequence mecej4 suggested (making allowance for apparent reversal of the array) would execute 3 fma operations in sequence. The best which could be done, using 2 more registers, would be to reduce this to 2 fma operations in sequence, along with another 2 fp operations which could be done in parallel.

The form posted first would probably take the latency of at least 3 sequential additions (12 cycles?) more than the one posed by mecej4, while also maximizing number of registers tied up.

Bernard · ‎06-21-2013

The code sequence above was not optimized for newest architecture and was compiled on Centrino processor.

The real concern is the load of coefficients in the form of SoA from memory.Btw on Merom processor addps will execute in 3 cycles(by Agner).

Bernard · ‎06-21-2013

>>>s there a way to rewrite this code so that it works faster>>>

At least you can try to get rid of division operation and replace it with multiplication by inverted value(of course where it is applicable).

jimdempseyatthecove · ‎06-21-2013

I think a major impediment to us providing useful answers is you have expressed your requirements in too narrow of scope.

TCARRE=T*T
TCUBE=T*TCARRE
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)

or

ENTH = ((C(4)*T + C(3))*T + C(2))*T + C(1)

As recommended.

I suspect that your application has (the equivilent of) arrays of these 4-element arrays. The organization of your data will greatly affect the advice you receive. Example:

real(8) :: yourModel(:,:) ! allocated to yourModel(4,nThingies)

And you encapsulate your expression into:

real(8) function foo(C,T)
real(8) :: C(4), T
foo = ((C(4)*T + C(3))*T + C(2))*T + C(1)
end function foo

And you call the funciton
DO I=1, nThingies
ENTH(I) = foo(yourModel(:,I), T)
END DO

Then note that T is invariatant for the duration of the loop

In this case, then consider:

real(8) function foo(C,T)
real(8) :: C(4), T(4)
real(8) :: temp(4)
temp = C * T
foo = temp(1) + temp(2) + temp(3) + temp(4)
end function foo

real(8) :: Tvec(4)
...
Tvec(1) = 1.0_8
Tvec(2) = Tvec(1)*T
Tvec(3) = Tvec(2)*T
Tvec(4) = Tvec(3)*T
DO I=1,nThingies
ENTH(I) = foo(yourModel(:,I), Tvec)
END DO

Jim Dempsey

FortranFan · ‎06-26-2013

Jim,

Very interesting and valid points.

On your second option for consideration by Vincent, can you please comment on the relative merits and demerits of using DOT_PRODUCT intrinsic in the FOO function, say as follows?

[fortran]

PURE FUNCTION foo(C,T) RESULT(ENTH)

REAL(8), INTENT(IN) :: C(:)

REAL(8), INTENT(IN) :: T(:)

!.. Function result

REAL(KIND(C)) :: ENTH

ENTH = DOT_PRODUCT(C, T)

END FUNCTION foo

[/fortran]

jimdempseyatthecove · ‎06-26-2013

FortranFan has it right, my code was presented to illustrate how you should re-think the problem.

Jim Dempsey

3th degree equation optimisation