Solved: Optimization of sine function's taylor expansion

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

sirrida · ‎05-24-2012

You should can gain something when you precalculate the reversed coefficients and when you reorder the codes.

movups xmm1,argument
mov ebx,OFFSET rev_coef
movups xmm0,argument
mulps xmm1,xmm1
movups xmm2,[ebx]
mulps xmm1,xmm0
mulps xmm1,xmm2
subps xmm0,xmm1

At least the coefficients should be aligned. If this is the case you could also try this:

movups xmm1,argument
movups xmm0,argument
mov ebx,OFFSET rev_coef
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[ebx]
subps xmm0,xmm1

If also the offset to rev_coef is constant you can also remove the load of ebx:

movups xmm1,argument
movups xmm0,argument
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[OFFSET rev_coef]
subps xmm0,xmm1

Bernard · ‎05-24-2012

Thanks for your fast answer!
Yep i did not think about the pre-calculations of coefficient's inverse ,it seem like good speed improvemnt.
I initially tried to write the codethat looks like your last example but in the runtime i got access violation error inspite of coefficient beign aligned.
By trying to rewrite this code like you showedi suppose that i could reach 100 cycles for less than 14 terms it could be comparable to hardware accelerated fcos and fsin instruction but with single precision accuracy.
Do you know what an implementation are fcos and fsin based on? I mean an approximation algorithm .

TimP · ‎05-24-2012

As the other responder implied, it doesn't make sense to leave division in an implementation of Taylor (or, better, Chebyshev economized or similar) power series. The rcp division approximation might be useable for the smallest term of a series. An iterative approximation to approach full accuracy takes as long (longer, on the most up to date CPUs) as a full implemenation of division, and is referred to by the Intel compilers as a "throughput" optimization, as the advantage would show when multiple independent calculations could be interleaved by instruction level parallelism. The same compile options which introduce this "throughput" optimization usually also perform automatic inversion of constants so as to replace division by multiplication.

TimP · ‎05-24-2012

The fsincos in the x87 firmware probably uses a rational approximation over a reduced range, taking advantage of range reduction using the fact that the 64-bit precision (10-byte long double) rounded value for Pi happens to be accurate to 66 bits. There's no guarantee that all x87 firmware implementations are the same internally, even though compatibility demands they give the same result and all have the feature of returning the argument itself when its absolute value exceeds 2^64. Math libraries based on SSE2 don't use the non-vectorizable x87 built-ins.
When you're developing your own approximation, a Chebyshev economized series expansion of sin(x)/x over a suitable interval, such as |x| < Pi/4, may be a good point of comparison.

Bernard · ‎05-24-2012

I have found this post "http://software.intel.com/en-us/forums/showthread.php?t=74354" one of the Intel engineers stated that compiler does not use x87 instructions and you stated that Math libraries do not use x87 instructions too.
I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.
Sorry but i do not know chebyshev series expansion.

TimP · ‎05-24-2012

You could study the Numerical Recipes on-line chapters about Chebyshev economization. A literal translation of their code into bc, or running it in 32-bit IA32 mode, will give you enough accuracy to come up with coefficients for a double series. You'll probably also want to consider variations on Horner's method for polynomial evaluation. These are about the simplest ways to come up with competitive math functions based on series approximations.
You might note that transforming a polynomial from separate terms to Horner's form is encouraged by the Fortran standard, although in practice no compilers can do the entire job automatically.
For vectorization, you can take the general approach of svml, where you calculate a number of results in parallel corresponding to the register width, or the vml method, applying your method to a vector.
You also have to consider accuracy vs. speed and vectorization issues in range reduction (discussed in the reference you mentioned).
I would go as far as possible with C or Fortran source code, then start applying intrinsics. This idea is strangely controversial, but I don't see how you can prototype or document without a working high level source code version. You're clearly up against a situation where starting to code in intrinsics without studying algorithms becomes pointless.

Bernard · ‎05-24-2012

I have "NR in C" book and i took a look at book's section about the polynomial expansion it is not so hard to implement it in source code be it java or c.
Sadly i do not know fortran albeit i think i could understand the code while reading it.
My intention is to code in pure assembly i do not like the idea of using intrinsics.I could study the formulae and it's implementation in high-level language source code and try to write the code in assembly or maybe use inline assembly in order to minimize the overhead of coding WindowsI/O in assembly.
The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

Bernard · ‎05-24-2012

sirrida
I wrote improved code exactly as you in your last code snippet.I measured it and i got on average 4-5 cycles.

bronxzv · ‎05-24-2012

notsin(x) but somewhat related, here is a 2 instructionsapproximation to log(x) in AVX2, it's verycoarse butpretty valuable for my use cases (3D graphics):
[bash]vcvtdq2ps ymm1, ymm0 vfmsub213ps ymm1, ymm3, ymm2[/bash]
ymm0: argument
ymm1: result
ymm2: constant (8x broadcast 8.2629582e-8f)
ymm3: constant (8x broadcast 87.989971f)

it'sbased on Paul Mineiro's "fasterlog"example here:http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html

Bernard · ‎05-24-2012

For my approximations i am using famous book "Handbook of mathematical functions".You can find there a plenty formulas for many elementary and special functions greatly recommended.Sadly my cpu does not support AVX2 instructions.

bronxzv · ‎05-24-2012

The Milton Abramowitz classic (Dover) ? I have it too, it's big and heavy!

A more useful book (IMO) for practical coding is Elementary Functions: Algorithms and Implementation by Jean-Michel Muller (2nd edition, Birkhuser)

Bernard · ‎05-24-2012

Yes Milton Abramowitz great book.I have already implemented in java 52elementary and special functions my code is based on formulas from this classic book. Now i am writing all those functions inpure assembly as a simple programms.I would also recommend you another classic book for error analysis of numerical methods
"Real Computing Made Real" great little book written by famous scientist Forman Acton.
Another book also "must have" is "Numerical Recipes in C".
Elementary Functions is my next buy.
Btw what is your IDE or assemblerfor your assembly projects i mean visual studio 2010 or maybe masm32 or nasm?
I use VS 2010 and masm32.

bronxzv · ‎05-25-2012

Actually I never write directly in assembly, all the ASM snippets that I post here are compiler outputs (sometimes with some massaging)

I use VS 2010 as IDE along with the Intel XE 2011 C++ compiler for code generation

For most performance critical kernels I don't rely of the vectorizer but use a high level framework (built around the intrinsics) based on C++ classeswith inlined functions and operators. The code is very readable thanks to operator overloading, for example writingsimply

vImg = T*vSrc;

allows to apply a 3D transform on packed vectors, it will amount for 9 packed multiplications + 9 packedadditions (i.e. the equivalent of 144 scalar fp instructions with AVX)

the actual generated code depends on the code path, for example register allocation will be different in 64-bit mode since there is more logical registers,the AVX2 path will use the FMA instructions, etc.

Bernard · ‎05-25-2012

Regarding high-level OO languages i mostly use java (obligatory language in my CS classes :( ) to lesser extent i use c++.
I see that you are interested in 3d graphicsprogramming.I would like to recommend you another classical book "Physically based rendering" i have 1 edition.This is the most helpful book to learn 3d graphics programming, but the book is heavy on advanced math.I try to develop following this book my own renderer
i write it in java i already have written vector ,point ,rotation and transformation classes. Now i'am struggling with integrator class which describes BSSRDF function my main problem is to calculate the integral because i do not want to simply copy book's code.
Do you know CUDA programming?

bronxzv · ‎05-25-2012

I'm a 3D garphics old timer so I havetons of books including"Physically based rendering" I'm not sureit's the best for a newbie, though

I have no practical experience with CUDA (just watched a few snippets in the GPU gems/ Shader X / GPU pro series) which is a dead man walking anyway. I'm 110% convinced that the futurefor 3D graphicslie in as high as possible languagesto cope with the ever increasing complexityof the 3D engines. We are on the verge of beingmore constrained by the programmer's productivity than by the raw FLOPS. Even C++ with operators overloading isn't enough for my taste, I will prefer a more readable notation(think Mathematica or Word Equation editor) at the momentwe have to write a complex algorithm twice, one time for the source code and another one to document the thing

Bernard · ‎05-25-2012

I have a great time with "Physically based rendering".I like to learn theory first.I made my first steps in 3D graphics with the help of books such as "Opengl Bible" it is great book for newbie but it lacks advanced math needed to understand what rendering is.
I do not agree with you on thesubject of programming languages for 3D graphics.I think that adding another layer of abstraction to hide the complexity of 3D algorithms is not good.Programmer must have the knowledge of inne rworking of an algorithm and technology which algorithm tries to describein order to understand it and to write better more optimized code.
For example at my Uni peopledo not take assembly language class because it is not obligatory.Their understanding of of the CPU architecture and OS inner working lacks because their areaccustomedto high-level languages like java or c#.

bronxzv · ‎05-25-2012

I'm not talking about hiding the algorithms, on the contraryI wantexpressingthem very clearly (in a compact form) and only once, not one timeat design and to document them for the rest of the team and one time to explain to the compiler what I want.

Bernard · ‎05-25-2012

Finally i understood you.You want to change the notation only?

bronxzv · ‎05-25-2012

Mostly the notation, yes, that's what academia compiler people keep describing as "cosmetic" or "syntactic sugar". There is certainly some good reasons all books use a mathematical notation for square roots, exponents, 1 to N sums, divide. etc., it's far more readable for us humans,even after years of coding experience. Any complex project is for the most ofits lifecycle in maintenance phase, very often you have only the source code at hand, and a refactoring effort (for a complex area) may well start with a very boring step where you attempt to retrieve the algorithm from the code and express it in an *equivalent*more readable form

Bernard · ‎05-25-2012

You are right generic programming procedural or object-oriented languages do not employ pure mathematical notation.For example solving linear equation in c or java it is not easy thing becauseyou have to deal with various indices, nested loops,pointers and etc...On theother hand array programming languages like matlab mostly hide itfrom theprogrammer.I think thjat good old fortran is better suited for mathematical programming.