Solved: Optimization of sine function's Taylor expansion - Page 6

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎06-12-2012

Hi
I increased iteration count to 30 000000 iterations and ran it a few times.I got very confused results.It seems to me that java fastsin() ran faster than native C-code fastsin().Can not use timingmethods based on
QueryPerformanceCounter when I sample javacode.
Results for native code in msec

start value of fastsin(): 16389855
end value of fastsin() : 16392273
delta of fastsin() is : 2418
sine is: 0.434965534111230230000000

Results for javain msec

running time of fastsin() is :647 millisec

fastsin()

0.434965534111230230000000

As You can see java fastsin() is almost four times faster than native code.

bronxzv · ‎06-12-2012

it's difficult for me to trust that the timingsare roughly the same at 1M iterations andvery different at 30M

I'll be interested to see the timings for both implementations with varying iteration counts, simply add an outer loop with 1M increment

Bernard · ‎06-12-2012

I have forgotten to tell You that java fastsin() is tested from within Eclipse IDE and in the past I got different results with the same code tested from console which ran under conhost.exe and from Eclipse IDE.
I rewrote fastsin() as a standalone class and tested it from the console only.Two setting were used
1. -server switch optimized for the speed
2. -client switch optimized for client application
Native fastsin() ran from VS 2010
results for 1e6 iterations -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339584027637
end value : 1339584027644
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

results for 1e6 iterations -client switch

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339584169499
end value : 1339584169520
running time of fastsin() is :21 milisec
sine 0.434965534111230230000000

native code results for 1e6 iterations

start value of fastsin(): 24697391
end value of fastsin() : 24697469
delta of fastsin() is : 78
sine is: 0.434965534111230230000000

results for 2 million iterations

java -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339584894864
end value : 1339584894871
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

java -client switch

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339584860265
end value : 1339584860304
running time of fastsin() is :39 milisec

native code 2 million iterations

start value of fastsin(): 25231664
end value of fastsin() : 25231804
delta of fastsin() is : 140 milisec
sine is: 0.434965534111230230000000

results for5 million iterations

java -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339585376561
end value : 1339585376568
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

java -client switch 5 million iterations

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339585333038
end value : 1339585333135
running time of fastsin() is :97 milisec
sine 0.434965534111230230000000

native code 5 million iterations

start value of fastsin(): 25629607
end value of fastsin() : 25629965
delta of fastsin() is : 358 milisec
sine is: 0.434965534111230230000000

results for 20 million iterations

java -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339585728253
end value : 1339585728260
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

java -client switch

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339585809120
end value : 1339585809500
running time of fastsin() is :380 milisec
sine 0.434965534111230230000000

native code

start value of fastsin(): 26029671
end value of fastsin() : 26031153
delta of fastsin() is : 1482 milisec

sine is: 0.434965534111230230000000

bronxzv · ‎06-12-2012

"java -server" scores are obviously bogus, probably due to the way you call your function with a constant (3rd time thatI point it out) which can be optimized in all sort of ways

you were saying that your java/C++ timings are the same at 1M iterations isn't it ? (both 63 ms) it's far from being the case after all

btw you were reporting C++ 1M at 63 ms, and now 78 ms (?),are you sure that you respect basic advices such as disabling speedstep and turbo + ensuring thread afinity to a single core + not running any other software (including antivirus, windows indexation, etc.)

Bernard · ‎06-12-2012

Sorry I did not understand what a constatnt have you been reffering to.Now I have corrected my error and fastsin() calls argument derived from the loop counter.

>>you were saying that your java/C++ timings are the same at 1M iterations isn't it ? (both 63 ms) it's far from being the case after all

Do not relate to this my measurement were wrong (did it with a constant).

Now I have performed another quick set of tests here are the results.I set affinity to single processor ,priority was set to 24(real-time).Everything non-relevant was closed.

java -server 20 million iteration (fastsin() called with reciprocal of loop counter)

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339595370770
end value : 1339595371222
running time of fastsin() is :452 milisec
sine 0.000000000000000000000000

java -client 20 million iterations (fastsin() called with reciprocal of loop counter)

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339595386620
end value : 1339595387533
running time of fastsin() is :913 milisec
sine 0.000000000000000000000000

native code 20 million loop iterations (fastsin() called with reciprocal of loop counter)

start value of fastsin(): 35965687
end value of fastsin() : 35967372
delta of fastsin() is : 1685
sine is: 0.000000000000000000000000

results for 1 million iterations (fastsin() called with reciprocal of loop counter)

java -server

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339596068015
end value : 1339596068045
running time of fastsin() is :30 milisec
sine 0.000000000000000000000000

java -client

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339596081083
end value : 1339596081130
running time of fastsin() is :47 milisec
sine 0.000000000000000000000000

native code

start value of fastsin(): 36452722
end value of fastsin() : 36452800
delta of fastsin() is : 78
sine is: 0.000000000000000000000000

Still as you can see native code is slower maybe probably because of overhead of security cookie beign checked on the stack.

bronxzv · ‎06-12-2012

Now I have performed another quick set of tests here are the results.I set affinity to single processor ,priority was set to 24(real-time).Everything non-relevant was closed.

neat, did you also disable enhanced speedstep in the BIOS (+ turbo if you have it) ? it typically introduces wild variations for short runs

fastsin() called with reciprocal of loop counter

bad idea! division may well be as slow as your full polynomial evaluation, I'll advise to work with precomputed values in an array fitting in the L1D$ + inline the C++ call, in other words do exactly like in my example

Bernard · ‎06-12-2012

I disabled turbo in BIOS.
Your method of testing seems to include overheadof two for-loops i.e 256 million calculations of loop counter.
main calls1e6 timesFastSinTest() and inside this function there is256 calculations of loop counter.Can this add significiant overhead and saturate the running time measurement?
I used compound addition statement as an argument to fastsin() , fastsin() is inlined.

result for native code 1 million iterations.

start value of fastsin(): 39492698
end value of fastsin() : 39492760
delta of fastsin() is : 62
sine is: 0.841470444509448080000000

java -server

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339596068015
end value : 1339596068045
running time of fastsin() is :30 milisec

java -client

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339596081083
end value : 1339596081130
running time of fastsin() is :47 milisec

bronxzv · ‎06-12-2012

used compound addition statement as an argument to fastsin()>

I'm not sure what you mean here, how about posting the code?

Can this add significiant overhead and saturate the running time measurement?

zero overhead I'll say on a modernIntel processor (4-wide issue), GPR increment and effective addresswill be computedin parallel with FPU computations and the branches (with 99+% prediction hit)will be macro fused, the critical path is clearly the polynomial evaluation with its long dependency chain

I'm not sure if you have still your branches (domain check) at the begining of your fastsin(), if you remove them you should start to see timings in clock cycles nearer than mine, besides my CPU with a bit better IPC, probably more consistent scores between JIT and native too

Bernard · ‎06-12-2012

I'm not sure if you have still your branches (domain check) at the begining of your fastsin(), if you remove them you should start to see timings in clock cycles nearer than mine, besides my CPU with a bit better IPC

Removed branches at the beginning of fastsin()and got the same result 63 ms per 1e6 iterations.What was your result?

For example the running time for gamma() stirling approximation implemented as in post #44
for 1e6 iteration with compound addition is 856 milisecond.Can you check this on your machine?

bronxzv · ‎06-12-2012

my latest full example is here: http://software.intel.com/en-us/forums/showpost.php?p=187021

scores were reported here: http://software.intel.com/en-us/forums/showpost.php?p=186968

I suggest that you test it *exactly as is* on your configuration to see how your timingscompare to mine

I'm not in for testing yet another example

Bernard · ‎06-12-2012

Your CPU is more powerful than mine I have core i3 2.226Ghz.I have never been able to reach less than 60ns.
What are your compiler settings?

bronxzv · ‎06-12-2012

Did you really test my example *as is* ? 60 ms/1e6 (i.e. >120 clocks) looks way too high

Which Core i3 ?

We don't use the same compiler AFAIK somy settings aren't very useful for you I'll say, look at the ASM instead and post yours

Bernard · ‎06-12-2012

Yes few days ago I tested your code , but i deleted it.
Core i3 Nehalem architecture.
Tommorow I will post asm code. Now I gotta go to work :).
Thanks for your help.

Bernard · ‎06-13-2012

I'm not sure what you mean here, how about posting the code?

double arg;
arg = 1.0d;
inside loop arg is incremented by floating-point addition of 0.000001
arg += 0.000001;
so inside the for-loop ther is an overhead of addsd instruction andmovsd instr.

asm code for polynomial block approx. of fastsin() function

[bash] 000f0 f2 0f 10 45 88 movsd xmm0, QWORD PTR _rad$[ebp] 000f5 f2 0f 59 45 80 mulsd xmm0, QWORD PTR _sqr$[ebp] 000fa f2 0f 10 4d 80 movsd xmm1, QWORD PTR _sqr$[ebp] 000ff f2 0f 59 4d 90 mulsd xmm1, QWORD PTR _coef11$[ebp] 00104 f2 0f 58 4d 98 addsd xmm1, QWORD PTR _coef10$[ebp] 00109 f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 0010e f2 0f 58 4d a0 addsd xmm1, QWORD PTR _coef9$[ebp] 00113 f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 00118 f2 0f 58 4d a8 addsd xmm1, QWORD PTR _coef8$[ebp] 0011d f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 00122 f2 0f 58 4d b0 addsd xmm1, QWORD PTR _coef7$[ebp] 00127 f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 0012c f2 0f 58 4d b8 addsd xmm1, QWORD PTR _coef6$[ebp] 00131 f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 00136 f2 0f 58 4d c0 addsd xmm1, QWORD PTR _coef5$[ebp] 0013b f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 00140 f2 0f 58 4d c8 addsd xmm1, QWORD PTR _coef4$[ebp] 00145 f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 0014a f2 0f 58 4d d0 addsd xmm1, QWORD PTR _coef3$[ebp] 0014f f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 00154 f2 0f 58 4d d8 addsd xmm1, QWORD PTR _coef2$[ebp] 00159 f2 0f 59 4d 80 mulsd xmm1, QWORD PTR _sqr$[ebp] 0015e f2 0f 58 4d e0 addsd xmm1, QWORD PTR _coef1$[ebp] 00163 f2 0f 59 c1 mulsd xmm0, xmm1 00167 f2 0f 58 45 88 addsd xmm0, QWORD PTR _rad$[ebp] 0016c f2 0f 11 45 f8 movsd QWORD PTR _sum$[ebp], xmm0 [/bash]

And look at this fastsin() prolog as you can see int 3 instructions are copied to the buffer.Afaik it is only done in pre-realease code.I think that this accounts for slower exec. speed when compared to java solution

[bash]00000 55 push ebp 00001 8b ec mov ebp, esp 00003 81 ec 80 00 00 00 sub esp, 128 ; 00000080H 00009 57 push edi 0000a 8d 7d 80 lea edi, DWORD PTR [ebp-128] 0000d b9 20 00 00 00 mov ecx, 32 ; 00000020H 00012 b8 cc cc cc cc mov eax, -858993460 ; ccccccccH 00017 f3 ab rep stosd[/bash]

bronxzv · ‎06-14-2012

concerning your prologue code I don't understand why you have it with an inlined call, maybe your "inline" statement is ignored ? do you have a stronger keyword available such as "_forceinline" ?but indeed starting a function with a useless "rep stosd" is not the best thing to doif you want fast code, it's probably a primary source of the poor performance you are enduring so far

anyway, *before startingany performance tuning* the first thing to do is to *compile in release mode*

your compiler keeps loading 'sqr' from the stack instead of using a register for it (when it's used by almost half the instructions, always read only), this looks like another potential source of the lackluster performance, on the other hand the java JIT may well be smarter and allocates a register for it provided thelow register pressure in this example

I see that the computation of 'sqr' is missing in your ASM dump and the useless store to 'rad' isn' shown too, so I suppose there is at least one mulsd + 2 avoidable store instructions not shown here, useless stores are typically bad for performance, next time please post a complete example i.e. the full loop like I do here: http://software.intel.com/en-us/forums/showpost.php?p=186968

all in all your compiler looks pretty weak for optimization and this is the most likely explanation for the java JIT compiled code faster than native code that you experience here

also note this is a perfect example for a JIT compiler since the compilation time of a short program is amortized over several millions iterations of the loop, the JIT compilation overhead is basically zero

SergeyKostrov · ‎06-14-2012

Quoting iliyapolak

>>...Read please this pdf written by W Kahan about awful performance of floating point java implementation
>>...www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf

Hi everybody,

Iliya,

Thanks for the link but, unfortunately,I'mnot impressedwith W. Kahan's work. I think it makes sence
to discuss iton some Java forum, or on'Software Tuning, Performance Optimization & Platform Monitoring' forum.

Best regards,
Sergey

Bernard · ‎06-15-2012

>>anyway, *before startingany performance tuning* the first thing to do is to *compile in release mode*

Of course I did not compile it in release ,hence this pesky debug mode overhead was induced by compiler.

>>I see that the computation of 'sqr' is missing in your ASM dump and the useless store to 'rad' isn' shown too

Useless rad assignment was removed thanks for spotting it.For millions of iteration such a useless store can be costly.

>>all in all your compiler looks pretty weak for optimization and this is the most likely explanation for the java JIT compiled code faster than native code that you experience here

Now I post code which was compiledin release mode.

The results of 10million iterations for release code

running time of fastsin() release code is: 31 milisec
fastsin() is: 0.909297421962549370000000

Full code which includes also main() loop
main()'s for - loop fastsin() call fully inlined inside the loop

[bash]; 23 : int main(void){ 00000 55 push ebp 00001 8b ec mov ebp, esp 00003 83 e4 c0 and esp, -64 ; ffffffc0H 00006 83 ec 30 sub esp, 48 ; 00000030H ; 24 : double e1 = 0; ; 25 : double sine; ; 26 : sine = 0; ; 27 : double gam; ; 28 : gam = 0; ; 29 : double fastgam; ; 30 : fastgam = 0; ; 31 : double arg1; ; 32 : arg1 = 1.0f; 00009 f2 0f 10 05 00 00 00 00 movsd xmm0, QWORD PTR _one 00011 53 push ebx 00012 55 push ebp 00013 56 push esi 00014 57 push edi ; 33 : unsigned int start2,end2; ; 34 : start2 = GetTickCount(); 00015 8b 3d 00 00 00 00 mov edi, DWORD PTR __imp__GetTickCount@0 0001b f2 0f 11 44 24 30 movsd QWORD PTR _arg1$[esp+64], xmm0 00021 ff d7 call edi 00023 f2 0f 10 15 00 00 00 00 movsd xmm2, QWORD PTR __real@3e7ad7f2a0000000 0002b f2 0f 10 25 00 00 00 00 movsd xmm4, QWORD PTR __real@3b4761b41316381a 00033 f2 0f 10 2d 00 00 00 00 movsd xmm5, QWORD PTR __real@3bd71b8ef6dcf572 0003b f2 0f 10 35 00 00 00 00 movsd xmm6, QWORD PTR __real@3c62f49b46814157 00043 f2 0f 10 5c 24 30 movsd xmm3, QWORD PTR _arg1$[esp+64] 00049 8b f0 mov esi, eax 0004b b8 40 42 0f 00 mov eax, 1000000 ; 000f4240H $LL9@main: ; 35 : for(int i2 = 0;i2<10000000;i2++){ 00050 48 dec eax ; 36 : arg1 += 0.0000001f; 00051 f2 0f 58 da addsd xmm3, xmm2 00055 f2 0f 58 da addsd xmm3, xmm2 00059 f2 0f 58 da addsd xmm3, xmm2 0005d f2 0f 58 da addsd xmm3, xmm2 00061 f2 0f 58 da addsd xmm3, xmm2 00065 f2 0f 58 da addsd xmm3, xmm2 00069 f2 0f 58 da addsd xmm3, xmm2 0006d f2 0f 58 da addsd xmm3, xmm2 00071 f2 0f 58 da addsd xmm3, xmm2 00075 f2 0f 58 da addsd xmm3, xmm2 ; 37 : sine = fastsin(arg1); 00079 66 0f 28 cb movapd xmm1, xmm3 0007d f2 0f 59 cb mulsd xmm1, xmm3 00081 66 0f 28 f9 movapd xmm7, xmm1 00085 f2 0f 59 fc mulsd xmm7, xmm4 00089 66 0f 28 c5 movapd xmm0, xmm5 0008d f2 0f 5c c7 subsd xmm0, xmm7 00091 f2 0f 59 c1 mulsd xmm0, xmm1 00095 f2 0f 5c c6 subsd xmm0, xmm6 00099 f2 0f 59 c1 mulsd xmm0, xmm1 0009d f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3ce952c77030ad4a 000a5 f2 0f 59 c1 mulsd xmm0, xmm1 000a9 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3d6ae7f3e733b81f 000b1 f2 0f 59 c1 mulsd xmm0, xmm1 000b5 f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3de6124613a86d09 000bd f2 0f 59 c1 mulsd xmm0, xmm1 000c1 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3e5ae64567f544e4 000c9 f2 0f 59 c1 mulsd xmm0, xmm1 000cd f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3ec71de3a556c734 000d5 f2 0f 59 c1 mulsd xmm0, xmm1 000d9 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3f2a01a01a01a01a 000e1 f2 0f 59 c1 mulsd xmm0, xmm1 000e5 f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3f81111111111111 000ed f2 0f 59 c1 mulsd xmm0, xmm1 000f1 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3fc5555555555555 000f9 f2 0f 59 cb mulsd xmm1, xmm3 000fd f2 0f 59 c1 mulsd xmm0, xmm1 00101 f2 0f 58 c3 addsd xmm0, xmm3 00105 f2 0f 11 44 24 30 movsd QWORD PTR _sine$[esp+64], xmm0 0010b 0f 85 3f ff ff ff jne $LL9main
[/bash]

Bernard · ‎06-15-2012

Here is the code compiler choses to use x87 instructions

[bash] 00000 dd 44 24 04 fld QWORD PTR _x$[esp-4] 00004 d9 c0 fld ST(0) 00006 d8 c9 fmul ST(0), ST(1) ; 103 : ; 104 : sum = x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); 00008 dd 05 00 00 00 00 fld QWORD PTR __real@3b4761b41316381a 0000e d8 c9 fmul ST(0), ST(1) 00010 dc 2d 00 00 00 00 fsubr QWORD PTR __real@3bd71b8ef6dcf572 00016 d8 c9 fmul ST(0), ST(1) 00018 dc 25 00 00 00 00 fsub QWORD PTR __real@3c62f49b46814157 0001e d8 c9 fmul ST(0), ST(1) 00020 dc 05 00 00 00 00 fadd QWORD PTR __real@3ce952c77030ad4a 00026 d8 c9 fmul ST(0), ST(1) 00028 dc 25 00 00 00 00 fsub QWORD PTR __real@3d6ae7f3e733b81f 0002e d8 c9 fmul ST(0), ST(1) 00030 dc 05 00 00 00 00 fadd QWORD PTR __real@3de6124613a86d09 00036 d8 c9 fmul ST(0), ST(1) 00038 dc 25 00 00 00 00 fsub QWORD PTR __real@3e5ae64567f544e4 0003e d8 c9 fmul ST(0), ST(1) 00040 dc 05 00 00 00 00 fadd QWORD PTR __real@3ec71de3a556c734 00046 d8 c9 fmul ST(0), ST(1) 00048 dc 25 00 00 00 00 fsub QWORD PTR __real@3f2a01a01a01a01a 0004e d8 c9 fmul ST(0), ST(1) 00050 dc 05 00 00 00 00 fadd QWORD PTR __real@3f81111111111111 00056 d8 c9 fmul ST(0), ST(1) 00058 dc 25 00 00 00 00 fsub QWORD PTR __real@3fc5555555555555 0005e d9 c9 fxch ST(1) 00060 d8 ca fmul ST(0), ST(2) 00062 de c9 fmulp ST(1), ST(0) 00064 de c1 faddp ST(1), ST(0) ; 105 : ; 106 : ; 107 : ; 108 : ; 109 : ; 110 : ; 111 : ; 112 : return sum; ; 113 : } 00066 c3 ret 0 ?fastsin@@YANN@Z ENDP ; fastsin[/bash]

bronxzv · ‎06-15-2012

we cansee that this code has still a lot of useless computations : all the addsd mm3,xmm2 at the start of the loop,it's a strange thing to basically multiply the value by 10, it's maybedue to the increment declaredas a floatand that's added to a double ?

31ms for 1e7 iterations, i.e. ~6 cycles per iteration (20xfaster than your previous reports) is too low, I don't see how it canmatch with the code you posted above, I suspect that you reported the timings of the 1e6 x case

bronxzv · ‎06-15-2012

it looks like your compiler do a better job at generating x87 code than SSE2

Bernard · ‎06-15-2012

>>it's maybedue to the increment declaredas a floatand that's added to a double

Yes.

>>31ms for 1e7 iterations, i.e. ~6 cycles per iteration (20xfaster than your previous reports) is too low

This isstrange because when I measured exec. time with 1e6 iterations the delta was 0.Bear in mind that only fastsin() is inlined inside the for-loop andmy other function (Exponential Integral polynomial approx)when compiled with release modeis called from the main andits result is 78 millisec per 1e6 itarations.

start val of fastsin() 19222349
end val of fastsin() 19222349
running time of fastsin() release code is: 0 millisec
fastsin() is: 0.891207360591512180000000

Compiler settings

Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Optimization of sine function's taylor expansion