- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My previous thread on sine function optimization grow very large so I was being asked by Sergey Kostrov to create a new threads for my other functions implementations.
Now I'am starting with gamma stirling approximation.The code is based on formula taken from "Handbook of Mathematical Functions" p. 257 formula 6.1.37
Function was tested against the book's table and later against the Mathematica 8which is used as areference model for an accuracy comparision.
There are two implementations.
One of them is based on 1d arrays ,library pow() calls and division inside for-loop
to calculate coefficients and to sum the series expansion.First function uses 14 terms of stirling expansion second uses 10 terms of stirling expansion btw regarding the slower implementation with more terms was not able to increase the accuracy when compared to Mathematica 8 reference result.
Second method is faster because array'scoefficients fillingand summation is eliminated
and also pow() library calls are not used.Instead the function relies on simple calculation of
argument x ofconsecutivepowers mulitplied by constant denominator and divided by thenumerator.
Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.
Accuracy when compared to Mathematica code is between 15-16 decimal places.
Feel free to optimize and test this code and please do not forget to post your opinion.
Code for C implementation of stirling gamma function
slower version
[bash]inline double gamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); //NaN }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double coef1[] = {1.0,1.0,139.0,571.0,163879.0, 5246819.0,534703531.0,4483131259.0, 432261921612371.0, 6232523202521089.0, 25834629665134204969.0,1579029138854919086429.0, 746590869962651602203151.0,1511513601028097903631961.0 }; double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0, 75246796800.0, 902961561600.0,86684309913600.0, 514904800886784000.0, 86504006548979712000.0, 13494625021640835072000.0,9716130015581401251840000.0, 116593560186976815022080000.0,2798245444487443560529920000.0 }; int const size = 14; double temp[size]; double temp2[size]; double temp3[size]; int i,j; long k; k = 0; for(i = 0;i
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mostly single precision but double precision is needed in some cases, think to the traversal of a deep scene-graph for example,in a big model the camera movements will be jerky when close to a small detail with single precisionRegarding precision do you use in your project double precission floating-point valuesbecause afaik Windows 7 rendering stack and current display hardware cannot work with more than 10-12bit per colour component.
finalLDR images (low dynamic range) aren't a good indicator of the precision required for shading, imagine a bright areabehind a dark translucent surface, thus all intermediate shading computations are with HDR imagesand only at the end we do the final tone mapping, this is the standard solution also for modern GPU-based realtime renderers
How do you calculate more esoteric function like "BSDF" which could be physically based and involves calculation of an integral(as stated in "Physically based rendering")book.rational approximations and LUT-based (using look-up tables), AVX2 gather instructions will come handy to speed up the LUT based ones
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
finalLDR images (low dynamic range) aren't a good indicator of the precision required for shading, imagine a bright areabehind a dark translucent surface, thus all intermediate shading computations are with HDR imagesand only at the end we do the final tone mapping, this is the standard solution also for modern GPU-based realtime renderers
As i understood it for internal processing you use double precision which could be converted to single precision for the final result which will be displayed on 10-12 bit RGBA hardware.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
no, all shading computations are with single precision (good enough quality and twice the throughput than with double precision, also note that 16-bit per component is already considered HDR), only some geometry transforms are with double precision, I'll say that for the renderer it's 99% float (SP)and 1% double (DP) overallAs i understood it for internal processing you use double precision which could be converted to single precision for the final result which will be displayed on 10-12 bit RGBA hardware.
the tone mapping operator is also float to float, only thevery final stage(gamma conversion + decimation) isfloat to int
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arrival such a architecture as a MIC which is more able to operate as a typical CPU and probably being used as a large co-processor to ofload the main CPU from vector intensive floating-pont calcualtions and inability of Nvidia to enter x86 market and be a threat to Intel or even Amd probably means that Nvidia will disappear like 3dfx.
What is your opinion on this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
provided that our engine is 100% basedon high-levelC++ code there is no reason to keep it only x86 based, depending on future market realities
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is code:
[bash]inline double fastgamma3(double x){ double result,sum,num,denom; result = 0; sum = 0; if(x >= 0.01f && x <= one){ double const coef1 = 6.69569585833067770821885e+6; double const coef2 = 407735.985300921332020398; double const coef3 = 1.29142492667105836457693e+6; double const coef4 = 1.00000000000000000000000000e+00; double const coef5 = 6.69558099277749024219574e+6; double const coef6 = 4.27571696102861619139483e+6; double const coef7 = -2.89391642413453042503323e+6; double const coef8 = 317457.367152592609873458; num = coef1+x*(coef2+x*(coef3));//MiniMaxApproximation calculated by Mathematica 8 denom = coef4+x*(coef5+x*(coef6+x*(coef7+x*(coef8))));//MiniMaxApproximation calculated by Mathematica 8 return num/denom; }else if( x >= one && x <= gamma_huge){ double const coef_1 = 0.08333333333333333333333333; double const coef_2 = 0.00347222222222222222222222; double const coef_3 = -0.00268132716049382716049383; double const coef_4 = -0.000229472093621399176954733; double const coef_5 = 0.000784039221720066627474035; double const coef_6 = 0.0000697281375836585777429399; double const coef_7 = -0.000592166437353693882864836; double const coef_8 = -0.0000517179090826059219337058; double const coef_9 = 0.000839498720672087279993358; double const coef_10 = 0.0000720489541602001055908572; double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double invx = 1/x; ln = exp(-x); arg = x-0.5; power = pow(x,arg); pi_sqrt = sqrt(two_pi); sum = ln*power*pi_sqrt; result = one+invx*(coef_1+invx*(coef_2+invx*(coef_3+invx*(coef_4+invx*(coef_5+invx*(coef_6+invx*(coef_7+invx*(coef_8+invx*(coef_9+invx*(coef_10)))))))))); } return sum*result; } Speed of execution for first branch (MiniMaxApproximation) 1e6 iterations
fastgamma3() start value is 25488363
fastgamma3() end value is 25488379
execution time of fastgamma3() 1e6 iterations is: 16 millisec
fastgamma3() is: 1.489191725597434100000000
[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Speed of execution has been improved by 4.5x when compared to my first implementation based on iterative calculations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It also depends on thesupport from Microsoft.nVIDIA roadmap for hexascalehas very ambitious designs with ARM cores
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »