<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimization of sine function's Taylor expansion in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825390#M1364</link>
    <description>&lt;STRONG&gt;&amp;gt;&amp;gt;anyway, *before startingany performance tuning* the first thing to do is to *compile in release mode*&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;Of course I did not compile it in release ,hence this pesky debug mode overhead was induced by compiler.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;I see that the computation of 'sqr' is missing in your ASM dump and the useless store to 'rad' isn' shown too&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Useless rad assignment was removed thanks for spotting it.For millions of iteration such a useless store can be costly.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;all in all your compiler looks pretty weak for optimization and this is the most likely explanation for the java JIT compiled code faster than native code that you experience here&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Now I post code which was compiledin &lt;SPAN style="text-decoration: underline;"&gt;release mode.&lt;BR /&gt;&lt;BR /&gt;The results of 10million iterations for release code&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;STRONG&gt;running time of fastsin() release code is: 31 milisec&lt;BR /&gt;fastsin() is: 0.909297421962549370000000&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Full code which includes also main() loop&lt;BR /&gt;main()'s for - loop fastsin() call fully inlined inside the loop&lt;BR /&gt;&lt;BR /&gt;[bash]; 23   :  int main(void){

  00000	55		 push	 ebp
  00001	8b ec		 mov	 ebp, esp
  00003	83 e4 c0	 and	 esp, -64		; ffffffc0H
  00006	83 ec 30	 sub	 esp, 48			; 00000030H

; 24   : 	double e1 = 0;
; 25   : 	 double sine;
; 26   : 	 sine = 0;
; 27   : 	double gam;
; 28   : 	gam = 0;
; 29   : 	double fastgam;
; 30   : 	fastgam = 0;
; 31   : 	 double arg1;
; 32   : 	 arg1 = 1.0f;

  00009	f2 0f 10 05 00
	00 00 00	 movsd	 xmm0, QWORD PTR _one
  00011	53		 push	 ebx
  00012	55		 push	 ebp
  00013	56		 push	 esi
  00014	57		 push	 edi

; 33   : 	unsigned int start2,end2;
; 34   : 	 start2 = GetTickCount();

  00015	8b 3d 00 00 00
	00		 mov	 edi, DWORD PTR __imp__GetTickCount@0
  0001b	f2 0f 11 44 24
	30		 movsd	 QWORD PTR _arg1$[esp+64], xmm0
  00021	ff d7		 call	 edi
  00023	f2 0f 10 15 00
	00 00 00	 movsd	 xmm2, QWORD PTR __real@3e7ad7f2a0000000
  0002b	f2 0f 10 25 00
	00 00 00	 movsd	 xmm4, QWORD PTR __real@3b4761b41316381a
  00033	f2 0f 10 2d 00
	00 00 00	 movsd	 xmm5, QWORD PTR __real@3bd71b8ef6dcf572
  0003b	f2 0f 10 35 00
	00 00 00	 movsd	 xmm6, QWORD PTR __real@3c62f49b46814157
  00043	f2 0f 10 5c 24
	30		 movsd	 xmm3, QWORD PTR _arg1$[esp+64]
  00049	8b f0		 mov	 esi, eax
  0004b	b8 40 42 0f 00	 mov	 eax, 1000000		; 000f4240H
$LL9@main:

; 35   : 	 for(int i2 = 0;i2&amp;lt;10000000;i2++){

  00050	48		 dec	 eax

; 36   : 		 arg1 += 0.0000001f;

  00051	f2 0f 58 da	 addsd	 xmm3, xmm2
  00055	f2 0f 58 da	 addsd	 xmm3, xmm2
  00059	f2 0f 58 da	 addsd	 xmm3, xmm2
  0005d	f2 0f 58 da	 addsd	 xmm3, xmm2
  00061	f2 0f 58 da	 addsd	 xmm3, xmm2
  00065	f2 0f 58 da	 addsd	 xmm3, xmm2
  00069	f2 0f 58 da	 addsd	 xmm3, xmm2
  0006d	f2 0f 58 da	 addsd	 xmm3, xmm2
  00071	f2 0f 58 da	 addsd	 xmm3, xmm2
  00075	f2 0f 58 da	 addsd	 xmm3, xmm2

; 37   : 		 sine = fastsin(arg1);

  00079	66 0f 28 cb	 movapd	 xmm1, xmm3
  0007d	f2 0f 59 cb	 mulsd	 xmm1, xmm3
  00081	66 0f 28 f9	 movapd	 xmm7, xmm1
  00085	f2 0f 59 fc	 mulsd	 xmm7, xmm4
  00089	66 0f 28 c5	 movapd	 xmm0, xmm5
  0008d	f2 0f 5c c7	 subsd	 xmm0, xmm7
  00091	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  00095	f2 0f 5c c6	 subsd	 xmm0, xmm6
  00099	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  0009d	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ce952c77030ad4a
  000a5	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000a9	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3d6ae7f3e733b81f
  000b1	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000b5	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3de6124613a86d09
  000bd	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000c1	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3e5ae64567f544e4
  000c9	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000cd	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ec71de3a556c734
  000d5	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000d9	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3f2a01a01a01a01a
  000e1	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000e5	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3f81111111111111
  000ed	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000f1	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3fc5555555555555
  000f9	f2 0f 59 cb	 mulsd	 xmm1, xmm3
  000fd	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  00101	f2 0f 58 c3	 addsd	 xmm0, xmm3
  00105	f2 0f 11 44 24
	30		 movsd	 QWORD PTR _sine$[esp+64], xmm0
  0010b	0f 85 3f ff ff
	ff		 jne	 &lt;A href="mailto:$LL9@main"&gt;$LL9main&lt;/A&gt;  &lt;BR /&gt;[/bash]</description>
    <pubDate>Fri, 15 Jun 2012 07:10:49 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2012-06-15T07:10:49Z</dc:date>
    <item>
      <title>Optimization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825269#M1243</link>
      <description>&lt;P&gt;Hello!&lt;BR /&gt;While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"&lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=52482"&gt;http://software.intel.com/en-us/forums/showthread.php?t=52482&lt;/A&gt;" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).&lt;BR /&gt;I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.&lt;BR /&gt;[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]&lt;/P&gt;</description>
      <pubDate>Thu, 24 May 2012 12:29:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825269#M1243</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-24T12:29:55Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825270#M1244</link>
      <description>You should can gain something when you precalculate the reversed coefficients and when you reorder the codes.&lt;BR /&gt;&lt;BR /&gt;movups xmm1,argument&lt;BR /&gt;mov ebx,OFFSET rev_coef&lt;BR /&gt;movups xmm0,argument&lt;BR /&gt;mulps xmm1,xmm1&lt;BR /&gt;movups xmm2,[ebx]&lt;BR /&gt;mulps xmm1,xmm0&lt;BR /&gt;mulps xmm1,xmm2&lt;BR /&gt;subps xmm0,xmm1&lt;BR /&gt;&lt;BR /&gt;At least the coefficients should be aligned. If this is the case you could also try this:&lt;BR /&gt;&lt;BR /&gt;movups xmm1,argument&lt;BR /&gt;
movups xmm0,argument&lt;BR /&gt;

mov ebx,OFFSET rev_coef&lt;BR /&gt;mulps xmm1,xmm1&lt;BR /&gt;
mulps xmm1,xmm0&lt;BR /&gt;

mulps xmm1,[ebx]&lt;BR /&gt;

subps xmm0,xmm1&lt;BR /&gt;&lt;BR /&gt;If also the offset to rev_coef is constant you can also remove the load of ebx:&lt;BR /&gt;&lt;BR /&gt;movups xmm1,argument&lt;BR /&gt;movups xmm0,argument&lt;BR /&gt;

mulps xmm1,xmm1&lt;BR /&gt;
mulps xmm1,xmm0&lt;BR /&gt;

mulps xmm1,[OFFSET rev_coef]&lt;BR /&gt;

subps xmm0,xmm1&lt;BR /&gt;</description>
      <pubDate>Thu, 24 May 2012 12:47:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825270#M1244</guid>
      <dc:creator>sirrida</dc:creator>
      <dc:date>2012-05-24T12:47:04Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825271#M1245</link>
      <description>Thanks for your fast answer!&lt;BR /&gt;Yep i did not think about the pre-calculations of coefficient's inverse ,it seem like good speed improvemnt.&lt;BR /&gt;I initially tried to write the codethat looks like your last example but in the runtime i got access violation error inspite of coefficient beign aligned.&lt;BR /&gt;By trying to rewrite this code like you showedi suppose that i could reach 100 cycles for less than 14 terms it could be comparable to hardware accelerated fcos and fsin instruction but with single precision accuracy.&lt;BR /&gt;Do you know what an implementation are fcos and fsin based on? I mean an approximation algorithm .</description>
      <pubDate>Thu, 24 May 2012 13:05:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825271#M1245</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-24T13:05:20Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825272#M1246</link>
      <description>As the other responder implied, it doesn't make sense to leave division in an implementation of Taylor (or, better, Chebyshev economized or similar) power series. The rcp division approximation might be useable for the smallest term of a series. An iterative approximation to approach full accuracy takes as long (longer, on the most up to date CPUs) as a full implemenation of division, and is referred to by the Intel compilers as a "throughput" optimization, as the advantage would show when multiple independent calculations could be interleaved by instruction level parallelism. The same compile options which introduce this "throughput" optimization usually also perform automatic inversion of constants so as to replace division by multiplication.</description>
      <pubDate>Thu, 24 May 2012 13:06:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825272#M1246</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-05-24T13:06:37Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825273#M1247</link>
      <description>The fsincos in the x87 firmware probably uses a rational approximation over a reduced range, taking advantage of range reduction using the fact that the 64-bit precision (10-byte long double) rounded value for Pi happens to be accurate to 66 bits. There's no guarantee that all x87 firmware implementations are the same internally, even though compatibility demands they give the same result and all have the feature of returning the argument itself when its absolute value exceeds 2^64. Math libraries based on SSE2 don't use the non-vectorizable x87 built-ins.&lt;BR /&gt;When you're developing your own approximation, a Chebyshev economized series expansion of sin(x)/x over a suitable interval, such as |x| &amp;lt; Pi/4, may be a good point of comparison.</description>
      <pubDate>Thu, 24 May 2012 13:23:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825273#M1247</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-05-24T13:23:38Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825274#M1248</link>
      <description>&lt;P&gt;I have found this post "&lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=74354"&gt;http://software.intel.com/en-us/forums/showthread.php?t=74354&lt;/A&gt;" one of the Intel engineers stated that compiler does not use x87 instructions and you stated that Math libraries do not use x87 instructions too.&lt;BR /&gt;I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.&lt;BR /&gt;Sorry but i do not know chebyshev series expansion.&lt;/P&gt;</description>
      <pubDate>Thu, 24 May 2012 14:08:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825274#M1248</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-24T14:08:05Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825275#M1249</link>
      <description>You could study the Numerical Recipes on-line chapters about Chebyshev economization. A literal translation of their code into bc, or running it in 32-bit IA32 mode, will give you enough accuracy to come up with coefficients for a double series. You'll probably also want to consider variations on Horner's method for polynomial evaluation. These are about the simplest ways to come up with competitive math functions based on series approximations.&lt;BR /&gt;You might note that transforming a polynomial from separate terms to Horner's form is encouraged by the Fortran standard, although in practice no compilers can do the entire job automatically.&lt;BR /&gt;For vectorization, you can take the general approach of svml, where you calculate a number of results in parallel corresponding to the register width, or the vml method, applying your method to a vector.&lt;BR /&gt;You also have to consider accuracy vs. speed and vectorization issues in range reduction (discussed in the reference you mentioned).&lt;BR /&gt;I would go as far as possible with C or Fortran source code, then start applying intrinsics. This idea is strangely controversial, but I don't see how you can prototype or document without a working high level source code version. You're clearly up against a situation where starting to code in intrinsics without studying algorithms becomes pointless.</description>
      <pubDate>Thu, 24 May 2012 14:48:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825275#M1249</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-05-24T14:48:25Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825276#M1250</link>
      <description>I have "NR in C" book and i took a look at book's section about the polynomial expansion it is not so hard to implement it in source code be it java or c.&lt;BR /&gt;Sadly i do not know fortran albeit i think i could understand the code while reading it.&lt;BR /&gt;My intention is to code in pure assembly i do not like the idea of using intrinsics.I could study the formulae and it's implementation in high-level language source code and try to write the code in assembly or maybe use inline assembly in order to minimize the overhead of coding WindowsI/O in assembly.&lt;BR /&gt;The best idea is try to study various approximation methodsandto compare them for speed and accuracy.</description>
      <pubDate>Thu, 24 May 2012 15:30:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825276#M1250</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-24T15:30:09Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825277#M1251</link>
      <description>&lt;P&gt;sirrida&lt;BR /&gt;I wrote improved code exactly as you in your last code snippet.I measured it and i got on average 4-5 cycles.&lt;/P&gt;</description>
      <pubDate>Thu, 24 May 2012 17:07:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825277#M1251</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-24T17:07:40Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825278#M1252</link>
      <description>notsin(x) but somewhat related, here is a 2 instructionsapproximation to log(x) in AVX2, it's verycoarse butpretty valuable for my use cases (3D graphics):&lt;BR /&gt;[bash]vcvtdq2ps ymm1, ymm0
vfmsub213ps ymm1, ymm3, ymm2[/bash]&lt;BR /&gt;ymm0: argument&lt;BR /&gt;ymm1: result&lt;BR /&gt;ymm2: constant (8x broadcast 8.2629582e-8f)&lt;BR /&gt;ymm3: constant (8x broadcast 87.989971f)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;it'sbased on Paul Mineiro's "fasterlog"example here:&lt;A href="http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html"&gt;http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 24 May 2012 18:17:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825278#M1252</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2012-05-24T18:17:49Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825279#M1253</link>
      <description>For my approximations i am using famous book "Handbook of mathematical functions".You can find there a plenty formulas for many elementary and special functions greatly recommended.Sadly my cpu does not support AVX2 instructions.&lt;BR /&gt;</description>
      <pubDate>Thu, 24 May 2012 19:21:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825279#M1253</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-24T19:21:40Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825280#M1254</link>
      <description>The Milton Abramowitz classic (Dover) ? I have it too, it's big and heavy!&lt;BR /&gt;&lt;BR /&gt;A more useful book (IMO) for practical coding is Elementary Functions: Algorithms and Implementation by Jean-Michel Muller (2nd edition, Birkhuser)&lt;BR /&gt;</description>
      <pubDate>Thu, 24 May 2012 19:31:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825280#M1254</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2012-05-24T19:31:20Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825281#M1255</link>
      <description>Yes Milton Abramowitz great book.I have already implemented in java 52elementary and special functions my code is based on formulas from this classic book. Now i am writing all those functions inpure assembly as a simple programms.I would also recommend you another classic book for error analysis of numerical methods &lt;BR /&gt;"Real Computing Made Real" great little book written by famous scientist Forman Acton.&lt;BR /&gt;Another book also "must have" is "Numerical Recipes in C".&lt;BR /&gt;Elementary Functions is my next buy.&lt;BR /&gt;Btw what is your IDE or assemblerfor your assembly projects i mean visual studio 2010 or maybe masm32 or nasm?&lt;BR /&gt;I use VS 2010 and masm32.</description>
      <pubDate>Fri, 25 May 2012 05:12:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825281#M1255</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-25T05:12:52Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825282#M1256</link>
      <description>Actually I never write directly in assembly, all the ASM snippets that I post here are compiler outputs (sometimes with some massaging)&lt;BR /&gt;&lt;BR /&gt;I use VS 2010 as IDE along with the Intel XE 2011 C++ compiler for code generation&lt;BR /&gt;&lt;BR /&gt;For most performance critical kernels I don't rely of the vectorizer but use a high level framework (built around the intrinsics) based on C++ classeswith inlined functions and operators. The code is very readable thanks to operator overloading, for example writingsimply &lt;BR /&gt;&lt;BR /&gt;vImg = T*vSrc;&lt;BR /&gt;&lt;BR /&gt;allows to apply a 3D transform on packed vectors, it will amount for 9 packed multiplications + 9 packedadditions (i.e. the equivalent of 144 scalar fp instructions with AVX)&lt;BR /&gt;&lt;BR /&gt;the actual generated code depends on the code path, for example register allocation will be different in 64-bit mode since there is more logical registers,the AVX2 path will use the FMA instructions, etc.</description>
      <pubDate>Fri, 25 May 2012 07:32:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825282#M1256</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2012-05-25T07:32:25Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825283#M1257</link>
      <description>Regarding high-level OO languages i mostly use java (obligatory language in my CS classes :( ) to lesser extent i use c++.&lt;BR /&gt;I see that you are interested in 3d graphicsprogramming.I would like to recommend you another classical book "Physically based rendering" i have 1 edition.This is the most helpful book to learn 3d graphics programming, but the book is heavy on advanced math.I try to develop following this book my own renderer &lt;BR /&gt;i write it in java i already have written vector ,point ,rotation and transformation classes. Now i'am struggling with integrator class which describes BSSRDF function my main problem is to calculate the integral because i do not want to simply copy book's code.&lt;BR /&gt;Do you know CUDA programming?&lt;BR /&gt;</description>
      <pubDate>Fri, 25 May 2012 09:11:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825283#M1257</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-25T09:11:39Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825284#M1258</link>
      <description>I'm a 3D garphics old timer so I havetons of books including"Physically based rendering" I'm not sureit's the best for a newbie, though&lt;BR /&gt;&lt;BR /&gt;I have no practical experience with CUDA (just watched a few snippets in the GPU gems/ Shader X / GPU pro series) which is a dead man walking anyway. I'm 110% convinced that the futurefor 3D graphicslie in as high as possible languagesto cope with the ever increasing complexityof the 3D engines. We are on the verge of beingmore constrained by the programmer's productivity than by the raw FLOPS. Even C++ with operators overloading isn't enough for my taste, I will prefer a more readable notation(think Mathematica or Word Equation editor) at the momentwe have to write a complex algorithm twice, one time for the source code and another one to document the thing</description>
      <pubDate>Fri, 25 May 2012 09:31:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825284#M1258</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2012-05-25T09:31:07Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825285#M1259</link>
      <description>I have a great time with "Physically based rendering".I like to learn theory first.I made my first steps in 3D graphics with the help of books such as "Opengl Bible" it is great book for newbie but it lacks advanced math needed to understand what rendering is.&lt;BR /&gt;I do not agree with you on thesubject of programming languages for 3D graphics.I think that adding another layer of abstraction to hide the complexity of 3D algorithms is not good.Programmer must have the knowledge of inne rworking of an algorithm and technology which algorithm tries to describein order to understand it and to write better more optimized code.&lt;BR /&gt;For example at my Uni peopledo not take assembly language class because it is not obligatory.Their understanding of of the CPU architecture and OS inner working lacks because their areaccustomedto high-level languages like java or c#.</description>
      <pubDate>Fri, 25 May 2012 09:55:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825285#M1259</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-25T09:55:17Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825286#M1260</link>
      <description>I'm not talking about hiding the algorithms, on the contraryI wantexpressingthem very clearly (in a compact form) and only once, not one timeat design and to document them for the rest of the team and one time to explain to the compiler what I want.</description>
      <pubDate>Fri, 25 May 2012 10:00:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825286#M1260</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2012-05-25T10:00:18Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825287#M1261</link>
      <description>Finally i understood you.You want to change the notation only?</description>
      <pubDate>Fri, 25 May 2012 10:19:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825287#M1261</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-05-25T10:19:53Z</dc:date>
    </item>
    <item>
      <title>Optimalization of sine function's taylor expansion</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825288#M1262</link>
      <description>Mostly the notation, yes, that's what academia compiler people keep describing as "cosmetic" or "syntactic sugar". There is certainly some good reasons all books use a mathematical notation for square roots, exponents, 1 to N sums, divide. etc., it's far more readable for us humans,even after years of coding experience. Any complex project is for the most ofits lifecycle in maintenance phase, very often you have only the source code at hand, and a refactoring effort (for a complex area) may well start with a very boring step where you attempt to retrieve the algorithm from the code and express it in an *equivalent*more readable form</description>
      <pubDate>Fri, 25 May 2012 10:30:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimization-of-sine-function-s-taylor-expansion/m-p/825288#M1262</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2012-05-25T10:30:34Z</dc:date>
    </item>
  </channel>
</rss>

