- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
movups xmm1,argument
mov ebx,OFFSET rev_coef
movups xmm0,argument
mulps xmm1,xmm1
movups xmm2,[ebx]
mulps xmm1,xmm0
mulps xmm1,xmm2
subps xmm0,xmm1
At least the coefficients should be aligned. If this is the case you could also try this:
movups xmm1,argument
movups xmm0,argument
mov ebx,OFFSET rev_coef
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[ebx]
subps xmm0,xmm1
If also the offset to rev_coef is constant you can also remove the load of ebx:
movups xmm1,argument
movups xmm0,argument
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[OFFSET rev_coef]
subps xmm0,xmm1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yep i did not think about the pre-calculations of coefficient's inverse ,it seem like good speed improvemnt.
I initially tried to write the codethat looks like your last example but in the runtime i got access violation error inspite of coefficient beign aligned.
By trying to rewrite this code like you showedi suppose that i could reach 100 cycles for less than 14 terms it could be comparable to hardware accelerated fcos and fsin instruction but with single precision accuracy.
Do you know what an implementation are fcos and fsin based on? I mean an approximation algorithm .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you're developing your own approximation, a Chebyshev economized series expansion of sin(x)/x over a suitable interval, such as |x| < Pi/4, may be a good point of comparison.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have found this post "http://software.intel.com/en-us/forums/showthread.php?t=74354" one of the Intel engineers stated that compiler does not use x87 instructions and you stated that Math libraries do not use x87 instructions too.
I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.
Sorry but i do not know chebyshev series expansion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You might note that transforming a polynomial from separate terms to Horner's form is encouraged by the Fortran standard, although in practice no compilers can do the entire job automatically.
For vectorization, you can take the general approach of svml, where you calculate a number of results in parallel corresponding to the register width, or the vml method, applying your method to a vector.
You also have to consider accuracy vs. speed and vectorization issues in range reduction (discussed in the reference you mentioned).
I would go as far as possible with C or Fortran source code, then start applying intrinsics. This idea is strangely controversial, but I don't see how you can prototype or document without a working high level source code version. You're clearly up against a situation where starting to code in intrinsics without studying algorithms becomes pointless.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sadly i do not know fortran albeit i think i could understand the code while reading it.
My intention is to code in pure assembly i do not like the idea of using intrinsics.I could study the formulae and it's implementation in high-level language source code and try to write the code in assembly or maybe use inline assembly in order to minimize the overhead of coding WindowsI/O in assembly.
The best idea is try to study various approximation methodsandto compare them for speed and accuracy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sirrida
I wrote improved code exactly as you in your last code snippet.I measured it and i got on average 4-5 cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash]vcvtdq2ps ymm1, ymm0 vfmsub213ps ymm1, ymm3, ymm2[/bash]
ymm0: argument
ymm1: result
ymm2: constant (8x broadcast 8.2629582e-8f)
ymm3: constant (8x broadcast 87.989971f)
it'sbased on Paul Mineiro's "fasterlog"example here:http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A more useful book (IMO) for practical coding is Elementary Functions: Algorithms and Implementation by Jean-Michel Muller (2nd edition, Birkhuser)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Real Computing Made Real" great little book written by famous scientist Forman Acton.
Another book also "must have" is "Numerical Recipes in C".
Elementary Functions is my next buy.
Btw what is your IDE or assemblerfor your assembly projects i mean visual studio 2010 or maybe masm32 or nasm?
I use VS 2010 and masm32.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use VS 2010 as IDE along with the Intel XE 2011 C++ compiler for code generation
For most performance critical kernels I don't rely of the vectorizer but use a high level framework (built around the intrinsics) based on C++ classeswith inlined functions and operators. The code is very readable thanks to operator overloading, for example writingsimply
vImg = T*vSrc;
allows to apply a 3D transform on packed vectors, it will amount for 9 packed multiplications + 9 packedadditions (i.e. the equivalent of 144 scalar fp instructions with AVX)
the actual generated code depends on the code path, for example register allocation will be different in 64-bit mode since there is more logical registers,the AVX2 path will use the FMA instructions, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see that you are interested in 3d graphicsprogramming.I would like to recommend you another classical book "Physically based rendering" i have 1 edition.This is the most helpful book to learn 3d graphics programming, but the book is heavy on advanced math.I try to develop following this book my own renderer
i write it in java i already have written vector ,point ,rotation and transformation classes. Now i'am struggling with integrator class which describes BSSRDF function my main problem is to calculate the integral because i do not want to simply copy book's code.
Do you know CUDA programming?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have no practical experience with CUDA (just watched a few snippets in the GPU gems/ Shader X / GPU pro series) which is a dead man walking anyway. I'm 110% convinced that the futurefor 3D graphicslie in as high as possible languagesto cope with the ever increasing complexityof the 3D engines. We are on the verge of beingmore constrained by the programmer's productivity than by the raw FLOPS. Even C++ with operators overloading isn't enough for my taste, I will prefer a more readable notation(think Mathematica or Word Equation editor) at the momentwe have to write a complex algorithm twice, one time for the source code and another one to document the thing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not agree with you on thesubject of programming languages for 3D graphics.I think that adding another layer of abstraction to hide the complexity of 3D algorithms is not good.Programmer must have the knowledge of inne rworking of an algorithm and technology which algorithm tries to describein order to understand it and to write better more optimized code.
For example at my Uni peopledo not take assembly language class because it is not obligatory.Their understanding of of the CPU architecture and OS inner working lacks because their areaccustomedto high-level languages like java or c#.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page