While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC
calls 1e6 times fastsin() the result in millisecond is 63
I have found this post "http://software.intel.com/en-us/forums/showthread.php?t=74354" one of the Intel engineers stated that compiler does not use x87 instructions and you stated that Math libraries do not use x87 instructions too.
I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.
Sorry but i do not know chebyshev series expansion.