Solved: Optimization of sine function's taylor expansion - Page 10

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

bronxzv · ‎06-27-2012

once you have refactored your code to use basic building blockslike EvalPoly5() here http://software.intel.com/en-us/forums/showpost.php?p=186020
your code will be not only more readable but it will be very easy to test alternate evaluation methods like the one suggested by sirrida herehttp://software.intel.com/en-us/forums/showpost.php?p=188457

Bernard · ‎06-27-2012

Interesting what were the values which caused such a great variation in cpi of fsin.

Bernard · ‎06-27-2012

Thanks to this forum and you guys I know how to perform basic code optimization.Refactored code is a lot of faster and more readable and elegant when compared to my previous inefficient and ugly coding style. Off topic question.How is it possible that every newn member already registered has some points assigned to him.

SergeyKostrov · ‎06-27-2012

Quoting iliyapolak

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week
Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.

Hi Iliya,

I just completed a simple verification and when ALL compiler optimizations disabled that code works:

[bash] ... // Sub-Test 6 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6 - [ Empty For Statement ]n") ); g_uiTicksStart = SysGetTickCount(); for( RTint t = 0; t < 100000000; t++ ) { ; } CrtPrintf( RTU("Sub-Test 6 - 100,000,000 iterations - %4ld msn"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) ); //*/ } ... [/bash]

SergeyKostrov · ‎06-27-2012

Quoting iliyapolak

Thanks to this forum and you guys I know how to perform basic code optimization...

I recommend you to look at Intel manuals at:

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?wapkw=Manuals

You could find there a manual on optimization techniques.

Best regards,
Sergey

SergeyKostrov · ‎06-27-2012

Quoting iliyapolak

...How is it possible that every newn member already registered has some points assigned to him.

I think because they didn't opt-out from the Intel Black Belt Software Developer Program:

http://www.intel.com/software/blackbelt

and every time when a member submits a postthe system assignssome number of points.

Best regards,
Sergey

Bernard · ‎06-27-2012

and every time when a member submits a postthe system assignssome number of points

That means that I have lost many points.

Bernard · ‎06-27-2012

I recommend you to look at Intel manuals at

Thanks for the link.
I have this manual.I would also like to recommend you a very interesting book"Inner Loops",here is link http://www.amazon.com/Inner-Loops-Sourcebook-Software-Development/dp/0201479605/ref=sr_1_1?s=books&ie=UTF8&qid=1340854371&sr=1-1&keywords=inner+loops

The book is old (even very old)is still dealing with Pentium and Pentium Pro code optimization,but you can find there a lot of info regarding various code optimization techniques performed mainly in assembly.

Bernard · ‎06-27-2012

just completed a simple verification and when ALL compiler optimizations disabled that code

In my case I probably did not disable optimization hence the result was 0.
Can you post your result?

SergeyKostrov · ‎06-28-2012

Quoting iliyapolak

and every time when a member submits a postthe system assignssome number of points
That means that I have lost many points.

Yes, unfortunately. I see that you're already in the program!

Bernard · ‎06-28-2012

I could have had a brown belt.Now working hard to regain my lost points.

TimP · ‎06-28-2012

The rcpps is likely to be useful only where 12 bits precision may be adequate. It's advocated with iterative improvement for "throughput" optimization where you have independent calculations pending which would be stalled during the time the fpu is occupied by divide. Core I7-3 cuts that stall time in half, so you should be aware that you may be optimizing for a past architecture. We are still struggling with software implementation of divide.
I don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients.
A speed improvement may be obtained by modifying the Horner scheme e.g.
x + x**3 * (a3 + a5 * x**2 + x**4 * (a7 + a9 * x**2))
so as to make fuller use of the instruction pipeline.
The Ivy Bridge "core I7-3" should speed up those minimax rational approximations, and fma hardware would help significantly with these polynomial evaluations.

Bernard · ‎06-28-2012

The rcpps is likely to be useful only where 12 bits precision may be adequate

rcpps was used only when I started to write in asm various functions expansions.I simply implemented these formulas almost "as is" i.e coefficients were not pre-calculatedand knowing that divaps is very slow I used rcppsto calculate the coefficients's reciprocals.I know that rcpps's precision is lacking so Istarted to pre-calculate taylor series coefficients in Mathematica 8 and implemented Horner scheme to speed up the running time of my library functions.For example I was able to achieve 4x improvment in my gamma stirling function when compared to pure iterative method.Here is thelink post #28 http://software.intel.com/en-us/forums/showthread.php?t=106032

>>don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients

As I stated earlier in my post rcpps was used to calculate on-the-fly coefficients of various taylor expansions.It is clear that rcpps can not be used with pre-calculated coefficients.
So I decided tocompletely rewrite my asembly based implementations.

SergeyKostrov · ‎06-28-2012

Related to an issue with measuring an overhead ofempty 'for' statement.

>>...
>>It should not be more than few cycles per iteration when unoptimized by compiler...
>>...

In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer.

[cpp]... // Sub-Test 6.2 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.2 - [ Empty For Statement ]n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 1000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.2 - 1,000,000 iterations - %4ld clock cyclesn"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 1000000 ) ); //*/ } // Sub-Test 6.3 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.3 - [ Empty For Statement ]n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 10000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.3 - 10,000,000 iterations - %4ld clock cyclesn"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 10000000 ) ); //*/ } ... [/cpp]

Output is as follows:

...
Sub-Test 6.2 - [ Empty For Statement ]
Sub-Test 6.2 - 1,000,000 iterations - 5 clock cycles
Sub-Test 6.3 - [ Empty For Statement ]
Sub-Test 6.3 - 10,000,000 iterations - 5 clock cycles
...

I used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles.

Bernard · ‎06-28-2012

In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer

Hi Sergey!

What is Hardware Run-Time Abstraction Layer?Does itis somehow relate to Windows HAL?
What are these types for example RTint ? Is it macro for int type?
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?

Bernard · ‎06-28-2012

used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles

Thanks for testing!
The results are exactly as I predicted a few cycles of overhead.Bear in mind that in real-world scenario CPU logic will execute for-loop statement in parallel with inside the loop statements.
The interesting scenario will arise when you will implement a for-loop based on floating-point counter and inside-loop floating-point calculations.Here CPU logic I suppose will interleave the execution of fpinstruction beetwen Port0 and Port1.

Bernard · ‎06-28-2012

Tim

We are still struggling with software implementation of divide

What do you mean by saying this?Are you refering to division algorithms implemented in hardware?

TimP · ‎06-29-2012

The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps. The Harpertown CPU made a big improvement in the efficiency of the IEEE accurate hardware instructions, and the Ivy Bridge core I7-3 is even better (aside from splitting the AVX-256 into sequenced 128-bit instructions). Yet we still have these iterative sequences.
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started.

SergeyKostrov · ‎06-29-2012

Quoting iliyapolak

In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer

Hi Sergey!

What is Hardware Run-Time Abstraction Layer?

[SergeyK] Please take a look at a thread:
http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr
Post #3

Does itis somehow relate to Windows HAL?

[SergeyK] No. It is an internal feature of the ScaLib project.

I will follow up on all the rest your questions later.

Best regards,
Sergey

Bernard · ‎06-29-2012

Sergey Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.

SergeyKostrov · ‎06-29-2012

Quoting iliyapolak

...
What are these types for example RTint ? Is it macro for int type?

[SergeyK]
No. RTint is declared asa typedef. Here is a simple fix for youbased onmacros:

...
#define RTint int
#define RTclock_t clock_t
#define CrtPrintf printf
#define RTU( text ) text
#define SysGetTickCount ::GetTickCount
...

When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?

[SergeyK]
Yes and it is applicable toany function, not just to some Win32 API function. In almost 99.99% ofmy Test-Cases
I use a Win32 API function 'GetTickCount' to measure time intervals. It completely satisfies the project needs.