Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

3th degree equation optimisation

Rudy__Delcroit
Beginner
584 Views

Dear All,

How can we rewriting the follow code to improve speed ?

I tried all the compiler option and found what was the optimum but still, I would like to have it quicker.

[fortran]

TCARRE=T*T      

TCUBE=T*TCARRE

ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)

[/fortran]

 

Thanks for your help,

Vincent

0 Kudos
15 Replies
mecej4
Honored Contributor III
584 Views

Read, among others, about the Horner/ Qin Jiushao method:

     ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)

P.S.: Corrected in response to TimP below: he pointed out that I had the indices of C in ascending order, whereas the O.P. has them in descending order.

0 Kudos
TimP
Honored Contributor III
584 Views

Or modified Horner,

ENTH=(C(1)*T+C(2))*TCARRE+(C(3)*T+C(4))

Try with /assume:protect_parens

0 Kudos
John_Campbell
New Contributor II
584 Views

Vincent,

Are you saying that what you wrote is better than as mecej4 implied ?

ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)

This version has fewer multiplies, but I am wondering if your version has better register retention ?

John

0 Kudos
SergeyKostrov
Valued Contributor II
584 Views
>>...This version has fewer multiplies, but I am wondering if your version has better register retention?.. John, Even if there are less multiplications in that equation you don't take into account that at a lowest level there are Latencies, Throughputs ( some calculations could be executed in parallel ) for instructions and Pipelines in a CPU. So, for a really small equation with a very limited number of degrees I wouldn't worry. Is that really critical for a 3rd degree polynomial? If you think that you're right go practical and prove that there is even a small difference. I hope that you've heard that expression from Mr. Spock in Star Trek movie: Any difference that makes no difference, is no difference...
0 Kudos
John_Campbell
New Contributor II
584 Views

Sergey,

I have always used the approach of mecej4, on the basis of minimal operations.
I was only asking Vincent if his question was based on experience that showed his approach worked better.
Tim's organisation of the equation is interesting, if the compiler can utilise this.

It is difficult to see the practical effect, as if it is being performed many times, then other vectorisation arrangements outside this code snippet could be more relevant.

John

0 Kudos
Bernard
Valued Contributor I
584 Views

 At assembly level the code (taylor sine approx. by Horner scheme) could look like this:

                    movups xmm1,veclib1//load of coefficients
                    mulps xmm0,xmm1
                    addps xmm7,xmm0 //Xmm7 accumulates the result
                    movups xmm1,veclib2
                    mulps xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib3
                    mulps  xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib4
                    mulps  xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib5

For few terms I do not think if one less or one more multiplication does really matter.

0 Kudos
Rudy__Delcroit
Beginner
584 Views

Thank you all, I replaced my code by the code suggested by mecej4 and got 10% speed increase.

 

Another (but similar) question about solving a 2sd degree equation, is there a way to rewrite this code so that it works faster ?

 

<fortran>

     ABC_Error = 0

       IF (A.EQ.0.)  THEN

          IF (B.EQ.0.)  THEN

             IF (C.EQ.0.)  THEN

                ABC_Error = -2

             ELSE

                ABC_Error = -3

             END IF

          ELSE

             X1 = -C/B

             X2 = X1

             ABC_Error = 1

          END IF

       ELSE

          D = B*B-4.*A*C

          IF (D.LT.0.)  THEN

               ABC_Error = -1

          ELSE IF (D.EQ.0.)

  THEN

             X1 = -B/(2.*A)

             X2 = X1

          ELSE

             XA = (-B-SIGN(SQRT(D),B))/(2.*A)

             XB = C/(A*XA)

             X1 = MAX(XA,XB)

             X2 = MIN(XA,XB)

          END IF

       END IF

 </fortran>

0 Kudos
SergeyKostrov
Valued Contributor II
584 Views
>>...It is difficult to see the practical effect, as if it is being performed many times... I disagree with that because I tested many polynomials of higher degree during last two years.
0 Kudos
Bernard
Valued Contributor I
584 Views

One cycle of latency per term does not make really any difference.Such a code can be also executed in parallel by two different Ports if there is no interdependencies involved.

0 Kudos
TimP
Honored Contributor III
584 Views

Latencies obviously depend on factors which haven't been defined here.  It may be easiest to guess latencies if you compile with /QxHost for Haswell.  In that case, the sequence mecej4 suggested (making allowance for apparent reversal of the array) would execute 3 fma operations in sequence.  The best which could be done, using 2 more registers, would be to reduce this to 2 fma operations in sequence, along with another 2 fp operations which could be done in parallel.

The form posted first would probably take the latency of at least 3 sequential additions (12 cycles?) more than the one posed by mecej4, while also maximizing number of registers tied up.

0 Kudos
Bernard
Valued Contributor I
584 Views

The code sequence above was not optimized for newest architecture and was compiled on Centrino processor.

The real concern is the load of coefficients in the form of SoA from memory.Btw on Merom processor addps will execute in 3 cycles(by Agner).

0 Kudos
Bernard
Valued Contributor I
584 Views

>>>s there a way to rewrite this code so that it works faster>>>

At least you can try to get rid of division operation and replace it with multiplication by inverted value(of course where it is applicable).

0 Kudos
jimdempseyatthecove
Honored Contributor III
584 Views

I think a major impediment to us providing useful answers is you have expressed your requirements in too narrow of scope.

TCARRE=T*T       
TCUBE=T*TCARRE 
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)

or

ENTH = ((C(4)*T + C(3))*T + C(2))*T + C(1)

As recommended.

I suspect that your application has (the equivilent of) arrays of these 4-element arrays. The organization of your data will greatly affect the advice you receive. Example:

real(8) :: yourModel(:,:) ! allocated to yourModel(4,nThingies)

And you encapsulate your expression into:

real(8) function foo(C,T)
real(8) :: C(4), T
foo = ((C(4)*T + C(3))*T + C(2))*T + C(1)
end function foo

And you call the funciton
DO I=1, nThingies
ENTH(I) = foo(yourModel(:,I), T)
END DO

Then note that T is invariatant for the duration of the loop

In this case, then consider:

real(8) function foo(C,T)
real(8) :: C(4), T(4)
real(8) :: temp(4)
temp = C * T
foo = temp(1) + temp(2) + temp(3) + temp(4)
end function foo

real(8) :: Tvec(4)
...
Tvec(1) = 1.0_8
Tvec(2) = Tvec(1)*T
Tvec(3) = Tvec(2)*T
Tvec(4) = Tvec(3)*T
DO I=1,nThingies
ENTH(I) = foo(yourModel(:,I), Tvec)
END DO

Jim Dempsey

0 Kudos
FortranFan
Honored Contributor II
584 Views

Jim,

Very interesting and valid points.

On your second option for consideration by Vincent, can you please comment on the relative merits and demerits of using DOT_PRODUCT intrinsic in the FOO function, say as follows?

[fortran]

PURE FUNCTION foo(C,T) RESULT(ENTH)

 

   REAL(8), INTENT(IN) :: C(:)

   REAL(8), INTENT(IN) :: T(:)

 

   !.. Function result

   REAL(KIND(C)) :: ENTH

 

   ENTH = DOT_PRODUCT(C, T)

 

END FUNCTION foo

 

[/fortran]

0 Kudos
jimdempseyatthecove
Honored Contributor III
584 Views

FortranFan has it right, my code was presented to illustrate how you should re-think the problem.

Jim Dempsey

0 Kudos
Reply