Optimization and Execution speed testing of gamma stirling approximation - Page 2

Bernard · ‎06-17-2012

Hi everybody!

My previous thread on sine function optimization grow very large so I was being asked by Sergey Kostrov to create a new threads for my other functions implementations.
Now I'am starting with gamma stirling approximation.The code is based on formula taken from "Handbook of Mathematical Functions" p. 257 formula 6.1.37
Function was tested against the book's table and later against the Mathematica 8which is used as areference model for an accuracy comparision.

There are two implementations.
One of them is based on 1d arrays ,library pow() calls and division inside for-loop
to calculate coefficients and to sum the series expansion.First function uses 14 terms of stirling expansion second uses 10 terms of stirling expansion btw regarding the slower implementation with more terms was not able to increase the accuracy when compared to Mathematica 8 reference result.
Second method is faster because array'scoefficients fillingand summation is eliminated
and also pow() library calls are not used.Instead the function relies on simple calculation of
argument x ofconsecutivepowers mulitplied by constant denominator and divided by thenumerator.
Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.
Accuracy when compared to Mathematica code is between 15-16 decimal places.
Feel free to optimize and test this code and please do not forget to post your opinion.

Code for C implementation of stirling gamma function
slower version

[bash]inline double gamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); //NaN }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double coef1[] = {1.0,1.0,139.0,571.0,163879.0, 5246819.0,534703531.0,4483131259.0, 432261921612371.0, 6232523202521089.0, 25834629665134204969.0,1579029138854919086429.0, 746590869962651602203151.0,1511513601028097903631961.0 }; double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0, 75246796800.0, 902961561600.0,86684309913600.0, 514904800886784000.0, 86504006548979712000.0, 13494625021640835072000.0,9716130015581401251840000.0, 116593560186976815022080000.0,2798245444487443560529920000.0 }; int const size = 14; double temp[size]; double temp2[size]; double temp3[size]; int i,j; long k; k = 0; for(i = 0;i = pow(x,k); } for(j = 0;j = (temp*coef2); } for(int j1 = 0; j1

Results of 1e6 iterations gamma() called with an argument incremented by 0.000001.Microsoft Optimizied compiler gamma() start value is: 13214329 gamma() end value is: 13214688 execution time of gamma() 1e6 iterations is: 359 millisec gamma 1.999999997453735200000000 Intel Compiler optimized for speed gamma() start value is: 24295767 gamma() end value is: 24296016 execution time of gamma() 1e6 iterations is: 249 millisec gamma 1.999999997453735200000000 Intel compiler settings /Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\\inline.c.pch" /FAcs /Fa"Release\\" /Fo"Release\\" /Fd"Release\\vc100.pdb" /Gd /TP

bronxzv · ‎06-20-2012

Regarding precision do you use in your project double precission floating-point valuesbecause afaik Windows 7 rendering stack and current display hardware cannot work with more than 10-12bit per colour component.

mostly single precision but double precision is needed in some cases, think to the traversal of a deep scene-graph for example,in a big model the camera movements will be jerky when close to a small detail with single precision
finalLDR images (low dynamic range) aren't a good indicator of the precision required for shading, imagine a bright areabehind a dark translucent surface, thus all intermediate shading computations are with HDR imagesand only at the end we do the final tone mapping, this is the standard solution also for modern GPU-based realtime renderers

How do you calculate more esoteric function like "BSDF" which could be physically based and involves calculation of an integral(as stated in "Physically based rendering")book.

rational approximations and LUT-based (using look-up tables), AVX2 gather instructions will come handy to speed up the LUT based ones

Bernard · ‎06-21-2012

finalLDR images (low dynamic range) aren't a good indicator of the precision required for shading, imagine a bright areabehind a dark translucent surface, thus all intermediate shading computations are with HDR imagesand only at the end we do the final tone mapping, this is the standard solution also for modern GPU-based realtime renderers

As i understood it for internal processing you use double precision which could be converted to single precision for the final result which will be displayed on 10-12 bit RGBA hardware.

bronxzv · ‎06-21-2012

As i understood it for internal processing you use double precision which could be converted to single precision for the final result which will be displayed on 10-12 bit RGBA hardware.

no, all shading computations are with single precision (good enough quality and twice the throughput than with double precision, also note that 16-bit per component is already considered HDR), only some geometry transforms are with double precision, I'll say that for the renderer it's 99% float (SP)and 1% double (DP) overall

the tone mapping operator is also float to float, only thevery final stage(gamma conversion + decimation) isfloat to int

Bernard · ‎06-21-2012

So the duty or perhaps roleof GPU in your project KRIBI 3D is reduced to copying with help of Directx the rendering context to thedevice's frame buffer and displaying it.

bronxzv · ‎06-21-2012

exactly, there is a cool picturefor this pure softwarerendering approach at slide 62 of this seminalpresentation:
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf

Bernard · ‎06-21-2012

It was a great read a very insightful article.So I think that in the nearest future pure software approach will take on a hybrid software-hardware approach.It means the "death" to Nvidia.
Arrival such a architecture as a MIC which is more able to operate as a typical CPU and probably being used as a large co-processor to ofload the main CPU from vector intensive floating-pont calcualtions and inability of Nvidia to enter x86 market and be a threat to Intel or even Amd probably means that Nvidia will disappear like 3dfx.
What is your opinion on this?

bronxzv · ‎06-21-2012

IIRC nVIDIA roadmap for hexascalehas very ambitious designs with ARM cores and massive SIMD fp arrays, it's a bit early to count them out, also I was among the many peoplepredicting the unavoidable demiseof Apple in the 90s, see how we were wrong...

provided that our engine is 100% basedon high-levelC++ code there is no reason to keep it only x86 based, depending on future market realities

Bernard · ‎06-21-2012

Testing and implementation of fastgamma() which isbased on MiniMaxApproximation for 0.01<1.0 and Horner scheme for range 1.0<160.0

Here is code:
[bash]inline double fastgamma3(double x){ double result,sum,num,denom; result = 0; sum = 0; if(x >= 0.01f && x <= one){ double const coef1 = 6.69569585833067770821885e+6; double const coef2 = 407735.985300921332020398; double const coef3 = 1.29142492667105836457693e+6; double const coef4 = 1.00000000000000000000000000e+00; double const coef5 = 6.69558099277749024219574e+6; double const coef6 = 4.27571696102861619139483e+6; double const coef7 = -2.89391642413453042503323e+6; double const coef8 = 317457.367152592609873458; num = coef1+x*(coef2+x*(coef3));//MiniMaxApproximation calculated by Mathematica 8 denom = coef4+x*(coef5+x*(coef6+x*(coef7+x*(coef8))));//MiniMaxApproximation calculated by Mathematica 8 return num/denom; }else if( x >= one && x <= gamma_huge){ double const coef_1 = 0.08333333333333333333333333; double const coef_2 = 0.00347222222222222222222222; double const coef_3 = -0.00268132716049382716049383; double const coef_4 = -0.000229472093621399176954733; double const coef_5 = 0.000784039221720066627474035; double const coef_6 = 0.0000697281375836585777429399; double const coef_7 = -0.000592166437353693882864836; double const coef_8 = -0.0000517179090826059219337058; double const coef_9 = 0.000839498720672087279993358; double const coef_10 = 0.0000720489541602001055908572; double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double invx = 1/x; ln = exp(-x); arg = x-0.5; power = pow(x,arg); pi_sqrt = sqrt(two_pi); sum = ln*power*pi_sqrt; result = one+invx*(coef_1+invx*(coef_2+invx*(coef_3+invx*(coef_4+invx*(coef_5+invx*(coef_6+invx*(coef_7+invx*(coef_8+invx*(coef_9+invx*(coef_10)))))))))); } return sum*result; } Speed of execution for first branch (MiniMaxApproximation) 1e6 iterations
fastgamma3() start value is 25488363
fastgamma3() end value is 25488379
execution time of fastgamma3() 1e6 iterations is: 16 millisec
fastgamma3() is: 1.489191725597434100000000

[/bash]

Bernard · ‎06-27-2012

In the range 0.01<1 I was able to achieve an accuracy of 6 decimal places when compared to Mathematica 8 reference model.In the range 1<160 the accuracy is 15-16 decimal places when compared to the refernce model.
Speed of execution has been improved by 4.5x when compared to my first implementation based on iterative calculations.

Bernard · ‎06-27-2012

nVIDIA roadmap for hexascalehas very ambitious designs with ARM cores

It also depends on thesupport from Microsoft.