Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?
This article was originally published on medium.com.
Posted on behalf of: Bob Chesebrough, Solutions Architect, Intel Corporation
This article shows the efficient variations in the way we approach computing Reciprocal Sqrt and these approaches can perform 3 to 4 orders of magnitude faster than older methods.
The approach you choose depends on your need for accuracy, speed, reliability, maintainability. See which one scores highest in your needs assessment!
What is reciprocal sqrt?
It is a function composed of two operations:
- Reciprocal (Multiplicative Inverse)
- Square root of a value or vector
Where is it used?
Physics/ Engineering
- Partial derivative of distance formula with respect to one dimensions, as follows:
Special Relativity Lorentz transformation
- The gamma coefficient is used to compute time dilation or length contraction among other calculations in special relativity
3D Graphics or ML applications needing normalization in space
- Vector normalization in video games and is mostly used in calculations involved in 3D programming — This is why the Fast Reciprocal Sqrt was invented for Quake III
This is an interesting topic, as it intersects recent history of algorithms, hardware and software library implementations. Methods for computing this quickly even as an approximation at first outstripped the instructions sets on modern x86 architecture. Two Berkeley guys, William Kahan and K.C. Ng wrote an unpublished paper in May 1986 describing how to calculate the square root using bit-fiddling techniques followed by Newton iterations. This was picked up by Cleve Moller and company of MATLAB* fame, Cleve’s associate Greg Walsh devised the now-famous constant and fast inverse square root algorithm. Read the fascinating history on Wikipedia.
It is still common when searching on the web and you will find many articles and advise on using this algorithm today. This algorithm is an ingenious application of Newton-Rapson method to get fast approximations of this 1/sqrt().
Then the introduction by Intel of the Pentium III in 1999 saw a new Streaming SIMD Extensions (SSE) instruction which would compute reciprocal sqrt as part of SSE instruction set — a vectorized instruction!
What is the best algorithm now?
That all depends on perspective. The Fast Reciprocal Sqrt is a clever trick, and from what I see its accuracy is with about 1% of the actual value. SSE and Advanced Vector Extensions (AVX) are vectorized instructions that are based on IEEE floating point standards, but this is also an approximation. Its accuracy depends on the nature of the data type you choose but is typically far better than 1%. When dealing floating point calculations such as these — the order of operations and how you accumulate partial sums etc matter. So let me provide you the code and a little discussion of what I found, and you be the judge!
What I test in this notebook? (see GitHub link at the end)
In this workbook, we will test the following approaches:
PyTorch rsqrt:
- Use torch built in function rsqrt()
NumPy_Compose RecipSqrt
- Use NumPy np.reciprocal(np.sqrt())
NumPy_Simple
- Use NumPy implicit vectors: b = a**-.5
Cython_Simple
- Use Cython variant of Simple a**-.5
Numba_Simple
- Use Numba njit variant of Simple a**-.5
BruteForceLoopExact
- Brute force loop approach no vectorization at all
Fast Reciprocal Sqrt Newton-Raphson simple Loop
- Fast Reciprocal Sqrt using Newton Raphson and Quake III approach
Fast Reciprocal Sqrt Newton-Raphson Vectorized
- Fast Reciprocal Sqrt using Newton Raphson and Quake III approach vectorized with np.apply
Fast Reciprocal Sqrt Newton-Raphson Cython
- Fast Reciprocal Sqrt using Newton Raphson and Quake III approach in Cython.
Where did the results finish?
It depends on the platform
On the new Intel® Tiber™ Developer Cloud: Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 core, 503GB RAM)
Testing Various algorithms and Optimizations for Inverse Square Root
My results tend to align with the observation of Doug Woo’s article, and I see orders of magnitude speedup using built in functions in NumPy and PyTorch.
Get the code for this article and the rest of the series is located on GitHub.
Next Steps
Try out this code sample using the standard free Intel Developer Cloud account and the ready-made Jupyter Notebook.
We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.
Intel Developer Cloud System Configuration as tested:
x86_64, CPU op-mode(s): 32-bit, 64-bit, Address sizes: 52 bits physical, 57 bits virtual, Byte Order: Little Endian, CPU(s): 224, On-line CPU(s) list: 0–223, Vendor ID: GenuineIntel, Model name: Intel® Xeon® Platinum 8480+, CPU family: 6, Model: 143, Thread(s) per core: 2, Core(s) per socket: 56, Socket(s): 2, Stepping: 8, CPU max MHz: 3800.0000, CPU min MHz: 800.0000
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.