Accelerate Numerical Calculations in NumPy with Intel® oneAPI Math Kernel Library (oneMKL)

Nikita_Shiledarbaxi · ‎08-09-2023

by

Bob Chesebrough, Solutions Architect

Intel Corporation

In this article, I will describe why some kinds of numerically intensive tasks in Python* run very slowly. I also explore steps to dramatically accelerate numerical calculations using the NumPy* package. But the magic is not simply using NumPy ndarray – that’s not really where the dramatic acceleration comes from – but rather it comes by replacing Python loops with a single NumPy function calls built on top of the Intel® oneAPI Math Kernel Library (oneMKL). These changes streamline the code and can make it:

Simpler,
More readable and maintainable,
Run faster on existing hardware, and
Ready for future hardware advances.

Why I am a big fan of Python

Over the years I have programmed professionally using Fortran, C/C++, Java, and Python. But for the last several years, Python has become my home base, my go-to language for crunching numbers, manipulating GPS data for dinosaur bone hunting, handling text-based chores such as finding commonalities in research papers, and more. The reasons I love Python are:

It is great for rapidly prototyping ideas.
You can tackle every imaginable coding task with a few lines of code.
Examples of code snippets are everywhere!
It is easy and fast to code in: Dynamically typed, thus making programming easy.
It is an interpretive language so there is no need to change gears to compile, link then execute.
Pythonic means it is very intuitive, even when I do not exactly know the syntax. I can guess a parameter which often just works.
For AI, it is fast to implement with little to no porting across x86 – Intel or AMS or even IBM* Power system.

When and why Python is slow at massively repeated low-level tasks

Python is slow for things such as:

Repeated low-level tasks
Large loops
Nested Loops
List comprehensions (if large)

Since Python is dynamically typed, every time we touch a new element of a list, for example, Python must determine whether the type of object is in that position in the list and what operations are valid for that object. It has to wrangle with reference counters to that object, and there is no guarantee that what we think should be consecutive objects in the list are actually consecutive in memory layout. In Figure 1 below, I am adding two lists, element by element, loop iteration followed by loop iteration. Every touch to read or write takes this overhead.

Figure 1. Adding two lists in Python

Compare the efficiency of the above approach to that illustrated in Figure 2 below, by SIMD (utilizing oneMKL) operations accessible via NumPy functions or operators – as illustrated, the work is consumed 8 times faster.

Figure 2. Adding two NumPy arrays with built-in NumPy operators, using oneMKL SIMD vectorization.

Since the data type is specified for the entire NumPy ndarray it naturally lays out in a “C” like memory pattern. This allows fast referencing of the data and efficient use of underlying cache lines and SIMD registers, which on modern x86 architectures. (specifically, the Intel® Xeon Scalable Processors) are 512 bits wide. If the data elements above are F32, this means that 16 concurrent (8 are shown above due to spatial limitations on a written page) floating point operations can be carried out per loop

iteration (512 bits divided by 32 bits = 16 units). This reduces the number of loops I need to consume by a factor of 16 as well. And this is NOT threading! This benefit occurs per core.

Figure 3 shows the random memory layout of 8 integers stored in a Python list. The memory locations are likely scattered across the memory in a random fashion.

Figure 3. Fictionalized memory layout of a list of 8 integers in a Python list.

In Figure 4 below, we contrast the random memory from Figure 3 to a ndarray laid out with N contiguous memory locations.

Figure 4. showing a ndarray with contiguous memory layout

Because memory is now contiguous, we can efficiently access these items with a single fetch from memory of a quantity of 8 or 16, or however many items fit in a 512-bit register. This is possible for each thread running on a CPU core!

Now consider that happens to cache utilization, contrasted between the two cases. The cache story is analogous to the following:

Consider you have several guests staying with you overnight. Each guest in turn, comes out for breakfast and requests a single egg for breakfast. You go to the refrigerator, grab a carton of eggs (representing what is known as a single cache line containing say 16 elements) and fry a random egg in the carton.

The next guest drops in for breakfast and requests a single egg. But in your harried state of mind with having so many guests, you forget you already have a mostly unused carton of eggs on the countertop, and dutifully head back to the refrigerator to grab another carton, only to fry a single random egg from it! See Figure 5.

Figure 5. Showing one single egg being retrieved from each carton

You repeat this procedure for each guest until the countertop is full of mostly unused egg cartons and they crash to the floor resulting in broken eggs. This is similar to a cache line eviction! See figure 6 below.

Figure 6. Showing the cartons knocked of the countertop, analogous to a cache line eviction

Now consider how a more efficient chef might prepare the eggs. As each guest arrives to request breakfast, each one asked for an egg, so you wisely consume all the eggs in one carton before going back to the refrigerator to grab another carton.

Figure 7. Showing a wise use of egg cartons analogous to a wise use of cache lines

The combined effect of wide cache utilization together with SIMD vectorization comprise the bulk of acceleration provided by NumPy functions and operators.

The question now becomes, how do I coax my Python code with all those loops into the more efficient NumPy vectorized form? You do it by “loop replacement”! Replace large trip count loops or multiple nested loops with an overall trip count and replace these loops with a single or a couple vector instructions instead.

Search and replace loops with NumPy functions and operators

Consider the simple loop-centric approach to computing the sin on each element of a list of 10,000 elements in Figure 8 below.

a = []
t1 = time.time()
timing = {}
for i in range(10_000
a.append(np.sin(i))
t2 = time.time()
print(f"With for loop and appending is took {t2-t1} seconds"
timing['loop'] = (t2-t1)

Figure 8. Computing the sin on each element of a list

One way, but not recommended way, of using NumPy is to replace the code above with what is shown in Figure 9 below.

a = np.array([])
t1 = time.time()
for i in range(10_000
a.append(np.sin(i))
t2 = time.time()
print(f"With for loop and appending NumPy is took {t2-t1} seconds"
timing['badNumPyLoop'] = (t2-t1)

Figure 9. Naïve and slow approach to replacing the "functionality" of the code in Figure 8

In Figure 9, we just replaced the intent of each line of code from Figure 8. This is not the goal. Instead, the goal is to replace the loop.

The code in Figure 9 was slower than the original Python loop, on my system by a factor of more than 3 times slower.

Now let us achieve the goal of replacing the loop with a SIMD vectorized friendly version of Figure 8.

In Figure 10 below, I show how to write more efficient NumPy code that replaces the original loop (well hides it) in optimized “C” oneMKL versions underpinning NumPy.

a = np.linspace(0, 10_000, num=10_000 + 1)
t1 = time.time()
a = np.sin(a)
t2 = time.time()
print(f"With linspace and unfunc is took {t2-t1} seconds"
timing['numpy'] = (t2-t1)

Figure 10. Removing the loop is more readable and faster

Try it yourself

Create two Jupyter notebook cells, one with code from Figure 8 and the other from Figure 10. Plot your acceleration (hint — you might want to use a semi log plot to capture your performance gains). You will be happily surprised at your results.

In future articles in this series, I will demonstrate the value of various NumPy functions and operators which accelerate and simplify code in a similar manner by taking advantage of Intel® oneMKL under the hood!

In the new series, I plan to discuss the following topics:

NumPy UFUNCS
NumPy Aggregations
Numpy Broadcasting and Fancy slicing
Numpy Sorting
NumPy Where and Select to vectorize loops with conditional logic
Applications to Pandas Apply statements
What to do when the loops get large and complex

Useful Resources

Here are some detailed resources for you to dive deeper into oneMKL:

We encourage you to check out the AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio.