Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
链接已复制
movups xmm1,argument
mov ebx,OFFSET rev_coef
movups xmm0,argument
mulps xmm1,xmm1
movups xmm2,[ebx]
mulps xmm1,xmm0
mulps xmm1,xmm2
subps xmm0,xmm1
At least the coefficients should be aligned. If this is the case you could also try this:
movups xmm1,argument
movups xmm0,argument
mov ebx,OFFSET rev_coef
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[ebx]
subps xmm0,xmm1
If also the offset to rev_coef is constant you can also remove the load of ebx:
movups xmm1,argument
movups xmm0,argument
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[OFFSET rev_coef]
subps xmm0,xmm1
Yep i did not think about the pre-calculations of coefficient's inverse ,it seem like good speed improvemnt.
I initially tried to write the codethat looks like your last example but in the runtime i got access violation error inspite of coefficient beign aligned.
By trying to rewrite this code like you showedi suppose that i could reach 100 cycles for less than 14 terms it could be comparable to hardware accelerated fcos and fsin instruction but with single precision accuracy.
Do you know what an implementation are fcos and fsin based on? I mean an approximation algorithm .
When you're developing your own approximation, a Chebyshev economized series expansion of sin(x)/x over a suitable interval, such as |x| < Pi/4, may be a good point of comparison.
I have found this post "http://software.intel.com/en-us/forums/showthread.php?t=74354" one of the Intel engineers stated that compiler does not use x87 instructions and you stated that Math libraries do not use x87 instructions too.
I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.
Sorry but i do not know chebyshev series expansion.
You might note that transforming a polynomial from separate terms to Horner's form is encouraged by the Fortran standard, although in practice no compilers can do the entire job automatically.
For vectorization, you can take the general approach of svml, where you calculate a number of results in parallel corresponding to the register width, or the vml method, applying your method to a vector.
You also have to consider accuracy vs. speed and vectorization issues in range reduction (discussed in the reference you mentioned).
I would go as far as possible with C or Fortran source code, then start applying intrinsics. This idea is strangely controversial, but I don't see how you can prototype or document without a working high level source code version. You're clearly up against a situation where starting to code in intrinsics without studying algorithms becomes pointless.
Sadly i do not know fortran albeit i think i could understand the code while reading it.
My intention is to code in pure assembly i do not like the idea of using intrinsics.I could study the formulae and it's implementation in high-level language source code and try to write the code in assembly or maybe use inline assembly in order to minimize the overhead of coding WindowsI/O in assembly.
The best idea is try to study various approximation methodsandto compare them for speed and accuracy.
[bash]vcvtdq2ps ymm1, ymm0 vfmsub213ps ymm1, ymm3, ymm2[/bash]
ymm0: argument
ymm1: result
ymm2: constant (8x broadcast 8.2629582e-8f)
ymm3: constant (8x broadcast 87.989971f)
it'sbased on Paul Mineiro's "fasterlog"example here:http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html
A more useful book (IMO) for practical coding is Elementary Functions: Algorithms and Implementation by Jean-Michel Muller (2nd edition, Birkhuser)
"Real Computing Made Real" great little book written by famous scientist Forman Acton.
Another book also "must have" is "Numerical Recipes in C".
Elementary Functions is my next buy.
Btw what is your IDE or assemblerfor your assembly projects i mean visual studio 2010 or maybe masm32 or nasm?
I use VS 2010 and masm32.
I use VS 2010 as IDE along with the Intel XE 2011 C++ compiler for code generation
For most performance critical kernels I don't rely of the vectorizer but use a high level framework (built around the intrinsics) based on C++ classeswith inlined functions and operators. The code is very readable thanks to operator overloading, for example writingsimply
vImg = T*vSrc;
allows to apply a 3D transform on packed vectors, it will amount for 9 packed multiplications + 9 packedadditions (i.e. the equivalent of 144 scalar fp instructions with AVX)
the actual generated code depends on the code path, for example register allocation will be different in 64-bit mode since there is more logical registers,the AVX2 path will use the FMA instructions, etc.
I see that you are interested in 3d graphicsprogramming.I would like to recommend you another classical book "Physically based rendering" i have 1 edition.This is the most helpful book to learn 3d graphics programming, but the book is heavy on advanced math.I try to develop following this book my own renderer
i write it in java i already have written vector ,point ,rotation and transformation classes. Now i'am struggling with integrator class which describes BSSRDF function my main problem is to calculate the integral because i do not want to simply copy book's code.
Do you know CUDA programming?
I have no practical experience with CUDA (just watched a few snippets in the GPU gems/ Shader X / GPU pro series) which is a dead man walking anyway. I'm 110% convinced that the futurefor 3D graphicslie in as high as possible languagesto cope with the ever increasing complexityof the 3D engines. We are on the verge of beingmore constrained by the programmer's productivity than by the raw FLOPS. Even C++ with operators overloading isn't enough for my taste, I will prefer a more readable notation(think Mathematica or Word Equation editor) at the momentwe have to write a complex algorithm twice, one time for the source code and another one to document the thing
I do not agree with you on thesubject of programming languages for 3D graphics.I think that adding another layer of abstraction to hide the complexity of 3D algorithms is not good.Programmer must have the knowledge of inne rworking of an algorithm and technology which algorithm tries to describein order to understand it and to write better more optimized code.
For example at my Uni peopledo not take assembly language class because it is not obligatory.Their understanding of of the CPU architecture and OS inner working lacks because their areaccustomedto high-level languages like java or c#.
