- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
your code will be not only more readable but it will be very easy to test alternate evaluation methods like the one suggested by sirrida herehttp://software.intel.com/en-us/forums/showpost.php?p=188457
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week
Hi Iliya,
I just completed a simple verification and when ALL compiler optimizations disabled that code works:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recommend you to look at Intel manuals at:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?wapkw=Manuals
You could find there a manual on optimization techniques.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think because they didn't opt-out from the Intel Black Belt Software Developer Program:
http://www.intel.com/software/blackbelt
and every time when a member submits a postthe system assignssome number of points.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That means that I have lost many points.and every time when a member submits a postthe system assignssome number of points
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the link.I recommend you to look at Intel manuals at
I have this manual.I would also like to recommend you a very interesting book"Inner Loops",here is link http://www.amazon.com/Inner-Loops-Sourcebook-Software-Development/dp/0201479605/ref=sr_1_1?s=books&ie=UTF8&qid=1340854371&sr=1-1&keywords=inner+loops
The book is old (even very old)is still dealing with Pentium and Pentium Pro code optimization,but you can find there a lot of info regarding various code optimization techniques performed mainly in assembly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my case I probably did not disable optimization hence the result was 0.just completed a simple verification and when ALL compiler optimizations disabled that code
Can you post your result?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That means that I have lost many points.and every time when a member submits a postthe system assignssome number of points
Yes, unfortunately. I see that you're already in the program!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients.
A speed improvement may be obtained by modifying the Horner scheme e.g.
x + x**3 * (a3 + a5 * x**2 + x**4 * (a7 + a9 * x**2))
so as to make fuller use of the instruction pipeline.
The Ivy Bridge "core I7-3" should speed up those minimax rational approximations, and fma hardware would help significantly with these polynomial evaluations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
rcpps was used only when I started to write in asm various functions expansions.I simply implemented these formulas almost "as is" i.e coefficients were not pre-calculatedand knowing that divaps is very slow I used rcppsto calculate the coefficients's reciprocals.I know that rcpps's precision is lacking so Istarted to pre-calculate taylor series coefficients in Mathematica 8 and implemented Horner scheme to speed up the running time of my library functions.For example I was able to achieve 4x improvment in my gamma stirling function when compared to pure iterative method.Here is thelink post #28 http://software.intel.com/en-us/forums/showthread.php?t=106032The rcpps is likely to be useful only where 12 bits precision may be adequate
>>don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients
As I stated earlier in my post rcpps was used to calculate on-the-fly coefficients of various taylor expansions.It is clear that rcpps can not be used with pre-calculated coefficients.
So I decided tocompletely rewrite my asembly based implementations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...
>>It should not be more than few cycles per iteration when unoptimized by compiler...
>>...
In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer.
[cpp]... // Sub-Test 6.2 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.2 - [ Empty For Statement ]n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 1000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.2 - 1,000,000 iterations - %4ld clock cyclesn"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 1000000 ) ); //*/ } // Sub-Test 6.3 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.3 - [ Empty For Statement ]n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 10000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.3 - 10,000,000 iterations - %4ld clock cyclesn"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 10000000 ) ); //*/ } ... [/cpp]
Output is as follows:
...
Sub-Test 6.2 - [ Empty For Statement ]
Sub-Test 6.2 - 1,000,000 iterations - 5 clock cycles
Sub-Test 6.3 - [ Empty For Statement ]
Sub-Test 6.3 - 10,000,000 iterations - 5 clock cycles
...
I used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey!In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer
What is Hardware Run-Time Abstraction Layer?Does itis somehow relate to Windows HAL?
What are these types for example RTint ? Is it macro for int type?
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for testing!used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles
The results are exactly as I predicted a few cycles of overhead.Bear in mind that in real-world scenario CPU logic will execute for-loop statement in parallel with inside the loop statements.
The interesting scenario will arise when you will implement a for-loop based on floating-point counter and inside-loop floating-point calculations.Here CPU logic I suppose will interleave the execution of fpinstruction beetwen Port0 and Port1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What do you mean by saying this?Are you refering to division algorithms implemented in hardware?Tim
We are still struggling with software implementation of divide
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey!In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer
What is Hardware Run-Time Abstraction Layer?
[SergeyK] Please take a look at a thread:
http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr
Post #3
Does itis somehow relate to Windows HAL?
[SergeyK] No. It is an internal feature of the ScaLib project.
I will follow up on all the rest your questions later.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are these types for example RTint ? Is it macro for int type?
[SergeyK]
No. RTint is declared asa typedef. Here is a simple fix for youbased onmacros:
...
#define RTint int
#define RTclock_t clock_t
#define CrtPrintf printf
#define RTU( text ) text
#define SysGetTickCount ::GetTickCount
...
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?
[SergeyK]
Yes and it is applicable toany function, not just to some Win32 API function. In almost 99.99% ofmy Test-Cases
I use a Win32 API function 'GetTickCount' to measure time intervals. It completely satisfies the project needs.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page