- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzvplease don't forget to tell us at the end how much the optimized version is faster than you first naive implementation
Thanks for the pdf it was a great read.On average a fewoptimized(polynomial approximations and pre-calculated coefficients)methods(in java) when compared to the same method which are based on for-loop with library function call and divisions are 5-6 times faster and sometimes even 10 times faster.Now I have to optimize 4000 lines of code :).
Btw whenI must write code for some esoteric function like kelvin ker(x) it is not so easy task to precalculate coefficients so the procedure must be written to automate this task based on the for-loops iterations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
with 4x unrolling it's down to 23.6 +/- 0.2 cycles
with vectorization and 4x unrolling we are at6.0 +/- 0.1 cycles
ASM :
[plain].B51.6:: ; Preds .B51.6 .B51.5 ; Infreq vmovsd xmm2, QWORD PTR [544+rbp+rax*8] ;480.47 vmulsd xmm0, xmm2, xmm2 ;480.35 vmulsd xmm3, xmm0, QWORD PTR [_2il0floatpacket.6991] ;480.35 vmulsd xmm1, xmm2, xmm0 ;480.35 vaddsd xmm4, xmm3, QWORD PTR [_2il0floatpacket.6990] ;480.35 vmulsd xmm5, xmm0, xmm4 ;480.35 vaddsd xmm11, xmm5, QWORD PTR [_2il0floatpacket.6989] ;480.35 vmulsd xmm12, xmm0, xmm11 ;480.35 vaddsd xmm13, xmm12, QWORD PTR [_2il0floatpacket.6988] ;480.35 vmulsd xmm14, xmm0, xmm13 ;480.35 vaddsd xmm15, xmm14, QWORD PTR [_2il0floatpacket.6987] ;480.35 vmulsd xmm3, xmm0, xmm15 ;480.35 vaddsd xmm4, xmm3, QWORD PTR [_2il0floatpacket.6986] ;480.35 vmulsd xmm5, xmm0, xmm4 ;480.35 vaddsd xmm11, xmm5, QWORD PTR [_2il0floatpacket.6985] ;480.35 vmulsd xmm12, xmm0, xmm11 ;480.35 vaddsd xmm13, xmm12, QWORD PTR [_2il0floatpacket.6984] ;480.35 vmulsd xmm14, xmm0, xmm13 ;480.35 vaddsd xmm15, xmm14, QWORD PTR [_2il0floatpacket.6983] ;480.35 vmulsd xmm3, xmm0, xmm15 ;480.35 vaddsd xmm4, xmm3, QWORD PTR [_2il0floatpacket.6982] ;480.35 vmulsd xmm0, xmm0, xmm4 ;480.35 vaddsd xmm0, xmm0, QWORD PTR [_2il0floatpacket.6981] ;480.35 vmulsd xmm1, xmm1, xmm0 ;480.35 vaddsd xmm2, xmm2, xmm1 ;480.35 vmovsd QWORD PTR [2592+rbp+rax*8], xmm2 ;480.49 inc rax ;480.35 cmp rax, 256 ;480.35 jl .B51.6 ; Prob 82% ;480.35[/plain]
source code :
[cpp]inline double fastsin(double x) { const double coef1 = -0.16666666666666666666666666666667, coef2 = 0.00833333333333333333333333333333, coef3 = -1.984126984126984126984126984127e-4, coef4 = 2.7557319223985890652557319223986e-6, coef5 = -2.5052108385441718775052108385442e-8, coef6 = 1.6059043836821614599392377170155e-10, coef7 = -7.6471637318198164759011319857881e-13, coef8 = 2.8114572543455207631989455830103e-15, coef9 = -8.2206352466243297169559812368723e-18, coef10 = 1.9572941063391261230847574373505e-20, coef11 = -3.8681701706306840377169119315228e-23; const double sqr = x*x; //x^2 return x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); } void FastSinTest(const double a[], double b[], int size) { for (int i=0; i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How the function chrono.getTime() is implemented?Does it use inline RDTSC or Win32 GetTickCount()?
The results with vectorization switched off are almost as expected to be in the range 20-30 cycles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Chrono is based on WIN32 QueryPerformanceFrequency / QueryPerformanceCounter
Chrono::getTime() returns a time in [ms] with better than 1e-3 ms accuracy
in this example totalrun time is around 1 sec so anywall-clock time based stopwatchis OK
you typicallyget better than single clock accuracywhen you can execute a lot of time the studied code, RDTSC is better suited for more complex cases though
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzv
Today I ran another set of tests on my fastsin() function.The code beetwen blocks of RDTSC instructions was loopedmillion times and the delta was divided by the number of loop iterations.
On average the calculated result was 1660 cycles per iteration thats mean 745.73 nanosecond.When I measured only polynomial approximation(coefficients assignment outside of the loop)I got 1075 cycles per iteration.
I also commented out the code and ran the tests the result for the same settings was 117 cycles(probably loop overhead itself)per iteration.I performed those tests according to this post http://software.intel.com/en-us/forums/showthread.php?t=52482.
My only explanation is that RDTSC and CPUID affect the accuracy of thetime measurement for arbitrary number of iteration.Probably cumulative sum of these instructions cpi is added to polynomial execution time cpi and simply distorts the true value.
Ran another set of testswherefastsin() was called1 milliontimes from the main() and the delta measured by the GetTickCount() in miliseconds was 78 miliseconds for 1 million iterations so the one iteration is
0.000078 milisec and this is78000 nanosecond per function call.How did you guys were able to achievemore accurate results.?
Btw here is the response from one of the Intel's engineer:
"rdtsc
{a small number of instructions with a cumulative latency less than hundreds or thousands of clocks} // AS IN MY CASE
rdtsc
is very unlikely to yield a reliable result. The CPUID step is probably not needed either, although there are other opinions on that point.
The rdtsc instruction is serializing with respect to itself. It is not actually serializing."
Second answer from the same post:
2." Measurements with RDTSC are most credible if the number of clocks between the pair of RDTSC instructions is at least a few thousand, preferably tens of thousands".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
without seeing your source code I can't point to the error(s) in your methodology
concerning RDTSC the best will beto read thePDF I have linked yesterday, it's more recent and far more insightful than the post you are refering to (btw using RDTSC is plain overkill for such a simple case with a single routine under study)
since I made the effort to test it and to post full source code I'll also suggest to compile my example in your environment, just replace the CPU frequency by yours and plug in your favorite stopwatch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
here is the snippet of code which calls fastsin() from the main()
The result of calculation was 78000 nanosecond for one iteration.
[bash] double sine = 0; unsigned int start,end,delta; start = GetTickCount(); for(int i = 0;i<1000000;i++){ sine = fastsin(0.45); } end = GetTickCount(); printf("start: %10d n",start); printf("end : %10d n", end); printf("delta: %d n", (end-start)); printf("sine : %0.24f n" ,sine);[/bash]- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The result of 1million iterations loop in cycles is 117.514276 cycles.Delta is in(hex)7012024 in (dec) 117514276, result is 117514276/1000000 = 117.514276.
[bash] int start_lo,start_hi,time_lo,time_hi,counter; counter = 1000000; _asm{ xor eax,eax xor edx,edx cpuid rdtsc mov start_lo,eax mov start_hi,edx } for(int i = 0;i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
since you call it with a constantthe Intel compiler will probably resolve it at compile time so I can't use it as is, also it will probably not loop over 1 million time since the same variable is written over and over again and it can see that fastsin has no global side effect, I'll suggest to base future tests on my code instead
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash] double fastsin(double x){ double sum = 0; double half_pi,zero_arg; half_pi = Pi/2; zero_arg = Zero; int const Iter = 1000000; if(x > half_pi){ return (x-x)/(x-x) ; }else if (x < zero_arg){ return (x-x)/(x-x); }else{ int start_lo,start_hi,time_lo,time_hi; _asm{ xor eax,eax xor edx,edx cpuid rdtsc mov start_lo,eax mov start_hi,edx } for(int i = 0;i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll adapt my code for using GetTickCount and post it shortly so that we have some common testbed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]
inline double fastsin(double x) { const double coef1 = -0.16666666666666666666666666666667, coef2 = 0.00833333333333333333333333333333, coef3 = -1.984126984126984126984126984127e-4, coef4 = 2.7557319223985890652557319223986e-6, coef5 = -2.5052108385441718775052108385442e-8, coef6 = 1.6059043836821614599392377170155e-10, coef7 = -7.6471637318198164759011319857881e-13, coef8 = 2.8114572543455207631989455830103e-15, coef9 = -8.2206352466243297169559812368723e-18, coef10 = 1.9572941063391261230847574373505e-20, coef11 = -3.8681701706306840377169119315228e-23; const double sqr = x*x; //x^2 return x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); } void FastSinTest(const double a[], double b[], int size) { #pragma unroll(4) for (int i=0; i
const int size = 256, nrOfTests = 1000000; double a[size],b[size]; for (int i=0; i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
fastsin() is called once from the main() and inside the function polynomial block is executed 1e6 times.I have never coded it to be called 1e12 times.When I tested it from the main polynomial block was not inside the loop.I simply posted two various test methods and you wrongly assumed that it ran 1e12 times.No it is not the case.
printf is called after the end of _asm block and the overhead of the full trip to the kernel mode multiplied by 1e6 values to be displayed is a lot of more than few hundred cycles per iteration.
here is the main() last test i got 1700cycles probably containsRDTSC+CPUID overhead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so, in other words,you have not posted a complete example using GetTickCount soit's difficult to say what you do wrong, more generally without full source code that all peers can compile welose a lot of time in misunderstandings (read:I'm getting bored)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
start value: 19839910
end value : 19839973
delta is : 63
sine is: 0.434965534111230230000000
code snippet
[bash]int main(void){ unsigned int const num_Iter = 1000000; unsigned int end,start,delta,i; double sine; sine = 0; start = GetTickCount(); for(i = 0;i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Iliya, your measured time, that is 63 milliseconds, forthe 'fastsine' functiondoesn't take into account some
overhead for the 'for' loop.
Could you try the following scheme? This is a pseudo code:
...
OverheadOfForStatement = 0
FastSineSpeedInMs = 0
Start = 0
End = 0
NumOfIterations = 2^22
ChangeProcessPriorityToRealTime()
Start = ::GetTickCount()
for( i = 0;i < NumOfIterations; i++) // Verify that aC++ compiler keeps the empty loop!
{
}
End= ::GetTickCount()
OverheadOfForStatement = (End - Start )
Start = ::GetTickCount()
for( i = 0;i < NumOfIterations; i++)
{
FastSine(...)
}
End= ::GetTickCount()
FastSineSpeedInMs = (End - Start- OverheadOfForStatement ) / NumOfIterations
ChangeProcessPriorityToNormal()
...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tomorrow I will start to post more comprehensive results from my testings.I will compare Java methods for calculation various math functions against the native code based implementation of the same functions.
"Iliya, your measured time, that is 63 milliseconds, forthe 'fastsine' functiondoesn't take into account some
overhead for the 'for' loop.
Could you try the following scheme? This is a pseudo code":
Results of loop-overhead testing
As you can see I cannotmeasure loop overheadmoreover Ialso checked with debugger that empty for-loop is executed.Priority wasset with the help of Process Explorer.Assembly instructions can be counted so overheadis sum of a few x86 instr like"jae[target], add 1 cmp,some_value" andshould be not more than a few cycles per iteration.
[bash]start value of loop_overhead : 5600529 end value of loop_overhead : 5600529 delta of loop_overhead is : 0 start value of fastsin(): 5600529 end value of fastsin() : 5600607 delta of fastsin() and loop_overhead is : 78 sine is: 0.434965534111230230000000[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tested today execution time Java version of my fastsin() function.Two timing methods were used:
1.polynomial approximation block's directly calculated delta with nanosecond resolution as returned by System.nanoTime() the result was 906 nanosec.
2.One million iterations foor-loop with fastsin() call inside the loop delta measured by System.currentTimeMillisec() method the result was 63 millisec exactly same as native code time delta.
The slow execution time of the first method I think is probably related to the possibility of System.nanoTime()
beign binary translatedby JIT compiler to RDTSC instruction.
How this could be possible that native codefastsin() and java fastsin() both of them have the same speed of execution.Even if Java JIT recognizes and optimizes hot-spots by binary translated code caching.
[bash]double value55 = 0; // JAVA version test for fastsin() method long start1,end1; final int max_Iter = 1000000; start1 = System.currentTimeMillis(); for(int k = 0; k < max_Iter;k++){ value55 = func.fastsin(0.45); } end1 = System.currentTimeMillis(); System.out.println("running time of fastsin9) is :" + (end1-start1) + "tmillisec"); System.out.printf("%15s n", "fastsin()"); System.out.printf("%.24f n",value55); }[/bash]RESULT:
running time of fastsin() is :63 millisec
fastsin()
0.434965534111230230000000
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll not rule out the fact that the JIT is able to do well on such a simple exampleHow this could be possible that native code fastsin() and java fastsin() both of them have the same speed of execution.Even if Java JIT recognizes and optimizes hot-spots by binary translated code caching.
now, you are using a stopwatch with 1 ms resolution so it's not enough precise to compare timings of around 63 ms (~ 3% max potential error), to improve the situation:
1) increase the iteration count so that it runs for roughly 2 s, i.e. ~30x more iterations, you'll have then 0.1% max error
and/or
2) use a more precise stopwatch based on QueryPerformanceCounter
also, as previously noted, I don't think that calling it with a constant is the best idea, insteadI'll advise to samplethe domain like in the example I provided the other day
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page