Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Technologies
- Intel® ISA Extensions
- Optimization of sine function's taylor expansion

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
05:29 AM

1,112 Views

Optimization of sine function's taylor expansion

While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).

I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.

[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

1 Solution

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-08-2012
02:21 AM

921 Views

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

Link Copied

342 Replies

sirrida

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
05:47 AM

410 Views

movups xmm1,argument

mov ebx,OFFSET rev_coef

movups xmm0,argument

mulps xmm1,xmm1

movups xmm2,[ebx]

mulps xmm1,xmm0

mulps xmm1,xmm2

subps xmm0,xmm1

At least the coefficients should be aligned. If this is the case you could also try this:

movups xmm1,argument

movups xmm0,argument

mov ebx,OFFSET rev_coef

mulps xmm1,xmm1

mulps xmm1,xmm0

mulps xmm1,[ebx]

subps xmm0,xmm1

If also the offset to rev_coef is constant you can also remove the load of ebx:

movups xmm1,argument

movups xmm0,argument

mulps xmm1,xmm1

mulps xmm1,xmm0

mulps xmm1,[OFFSET rev_coef]

subps xmm0,xmm1

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
06:05 AM

410 Views

Yep i did not think about the pre-calculations of coefficient's inverse ,it seem like good speed improvemnt.

I initially tried to write the codethat looks like your last example but in the runtime i got access violation error inspite of coefficient beign aligned.

By trying to rewrite this code like you showedi suppose that i could reach 100 cycles for less than 14 terms it could be comparable to hardware accelerated fcos and fsin instruction but with single precision accuracy.

Do you know what an implementation are fcos and fsin based on? I mean an approximation algorithm .

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
06:06 AM

410 Views

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
06:23 AM

410 Views

When you're developing your own approximation, a Chebyshev economized series expansion of sin(x)/x over a suitable interval, such as |x| < Pi/4, may be a good point of comparison.

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
07:08 AM

410 Views

I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.

Sorry but i do not know chebyshev series expansion.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
07:48 AM

410 Views

You might note that transforming a polynomial from separate terms to Horner's form is encouraged by the Fortran standard, although in practice no compilers can do the entire job automatically.

For vectorization, you can take the general approach of svml, where you calculate a number of results in parallel corresponding to the register width, or the vml method, applying your method to a vector.

You also have to consider accuracy vs. speed and vectorization issues in range reduction (discussed in the reference you mentioned).

I would go as far as possible with C or Fortran source code, then start applying intrinsics. This idea is strangely controversial, but I don't see how you can prototype or document without a working high level source code version. You're clearly up against a situation where starting to code in intrinsics without studying algorithms becomes pointless.

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
08:30 AM

410 Views

Sadly i do not know fortran albeit i think i could understand the code while reading it.

My intention is to code in pure assembly i do not like the idea of using intrinsics.I could study the formulae and it's implementation in high-level language source code and try to write the code in assembly or maybe use inline assembly in order to minimize the overhead of coding WindowsI/O in assembly.

The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
10:07 AM

410 Views

I wrote improved code exactly as you in your last code snippet.I measured it and i got on average 4-5 cycles.

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
11:17 AM

410 Views

[bash]vcvtdq2ps ymm1, ymm0 vfmsub213ps ymm1, ymm3, ymm2[/bash]

ymm0: argument

ymm1: result

ymm2: constant (8x broadcast 8.2629582e-8f)

ymm3: constant (8x broadcast 87.989971f)

it'sbased on Paul Mineiro's "fasterlog"example here:http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
12:21 PM

410 Views

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
12:31 PM

410 Views

A more useful book (IMO) for practical coding is Elementary Functions: Algorithms and Implementation by Jean-Michel Muller (2nd edition, Birkhuser)

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2012
10:12 PM

410 Views

"Real Computing Made Real" great little book written by famous scientist Forman Acton.

Another book also "must have" is "Numerical Recipes in C".

Elementary Functions is my next buy.

Btw what is your IDE or assemblerfor your assembly projects i mean visual studio 2010 or maybe masm32 or nasm?

I use VS 2010 and masm32.

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
12:32 AM

410 Views

I use VS 2010 as IDE along with the Intel XE 2011 C++ compiler for code generation

For most performance critical kernels I don't rely of the vectorizer but use a high level framework (built around the intrinsics) based on C++ classeswith inlined functions and operators. The code is very readable thanks to operator overloading, for example writingsimply

vImg = T*vSrc;

allows to apply a 3D transform on packed vectors, it will amount for 9 packed multiplications + 9 packedadditions (i.e. the equivalent of 144 scalar fp instructions with AVX)

the actual generated code depends on the code path, for example register allocation will be different in 64-bit mode since there is more logical registers,the AVX2 path will use the FMA instructions, etc.

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
02:11 AM

410 Views

I see that you are interested in 3d graphicsprogramming.I would like to recommend you another classical book "Physically based rendering" i have 1 edition.This is the most helpful book to learn 3d graphics programming, but the book is heavy on advanced math.I try to develop following this book my own renderer

i write it in java i already have written vector ,point ,rotation and transformation classes. Now i'am struggling with integrator class which describes BSSRDF function my main problem is to calculate the integral because i do not want to simply copy book's code.

Do you know CUDA programming?

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
02:31 AM

410 Views

I have no practical experience with CUDA (just watched a few snippets in the GPU gems/ Shader X / GPU pro series) which is a dead man walking anyway. I'm 110% convinced that the futurefor 3D graphicslie in as high as possible languagesto cope with the ever increasing complexityof the 3D engines. We are on the verge of beingmore constrained by the programmer's productivity than by the raw FLOPS. Even C++ with operators overloading isn't enough for my taste, I will prefer a more readable notation(think Mathematica or Word Equation editor) at the momentwe have to write a complex algorithm twice, one time for the source code and another one to document the thing

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
02:55 AM

410 Views

I do not agree with you on thesubject of programming languages for 3D graphics.I think that adding another layer of abstraction to hide the complexity of 3D algorithms is not good.Programmer must have the knowledge of inne rworking of an algorithm and technology which algorithm tries to describein order to understand it and to write better more optimized code.

For example at my Uni peopledo not take assembly language class because it is not obligatory.Their understanding of of the CPU architecture and OS inner working lacks because their areaccustomedto high-level languages like java or c#.

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
03:00 AM

410 Views

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
03:19 AM

410 Views

Finally i understood you.You want to change the notation only?

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
03:30 AM

410 Views

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-25-2012
03:53 AM

185 Views

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.