Replacing Python loops: Fancy Slicing and Broadcasting

Ramya_Ravi · ‎12-27-2024

This article was originally published on medium.com.

Posted on behalf of: Bob Chesebrough, Solutions Architect, Intel Corporation

In this article, we will demonstrate how to perform fancy slicing and broadcasting.

Slicing — aka fancy slicing:

Slicing in python, means to target a specified subrange of an array or a subrange of a string. It is specified by using “:” in the square bracket index range.

a[start:stop]  # items start through stop-1
a[start:]      # items start through the rest of the array
a[:stop]       # items from the beginning through stop-1
a[:]           # a copy of the whole array

Slicing should be used rather than a loop to target or select a subrange for various operations. This slicing constructs are wired to handle cache optimization and SIMD vectorization in NumPy and PyTorch.

For machine learning , slicing is commonly used is for doing the necessary offsets for handling time series data.

For example, let’s consider stock price prediction example to perform slicing. It is common in time series data to predict a value x time units in the future. This is done by taking a label and shifting the indexes so that the future label aligns with features from the past.

Also, we know that copying the data as a loop is very inefficient, and for large arrays it is very time consuming.

Here a snippet of stock like data:

Table 1: stock like time series data

The goal here is to train a model to predict a stock price at time T+3 from all observation up to time T (so we want to predict three days into the future).

Table 2 shows how this offset label in time compares to the original times series data. Here we observe up to Dec 20 to predict the price 133.7 that was later observed on Dec 23. We hope to build a predictor that predict out 3 days in advance. We would like to ultimately predict the prices for Dec 29, 30, 31 for which we have data — before we commit our model into deployment.

Table 2: Show alignment of X features to the predicted future values under ‘Predict This’

A simple minded python loop centric way of creating this label column for y, is as follows:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('Assets/StockTimeSeries.csv')
X = df['Stock'].to_numpy()
Toffset = 3
y = np.zeros([len(X)])
for i in range(len(X) - Toffset 
    y[i] = X[i+Toffset]
# predict for y the last three empty slots (NAN)
X,y

output:
(array([145.9933047, 142.333029 , 125.6291325, 133.715583 , 130.6002384,
        133.9623597, 139.7279166, 137.5097947, 135.1774904, 125.9596717,
        135.2106359, 142.6324476]),
 array([133.715583 , 130.6002384, 133.9623597, 139.7279166, 137.5097947,
        135.1774904, 125.9596717, 135.2106359, 142.6324476,   0.       ,
          0.       ,   0.       ]))

Next, we refactor the above code to use slicing

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('Assets/StockTimeSeries.csv')
X = df['Stock'].to_numpy()
# Slicing for time series labels
y = X[3:] # train on X except last two, labels are shifting 3 days into the future
# predict for y the last three empty slots (NAN)
X,y

(array([145.9933047, 142.333029 , 125.6291325, 133.715583 , 130.6002384,
        133.9623597, 139.7279166, 137.5097947, 135.1774904, 125.9596717,
        135.2106359, 142.6324476]),
 array([133.715583 , 130.6002384, 133.9623597, 139.7279166, 137.5097947,
        135.1774904, 125.9596717, 135.2106359, 142.6324476]))

To test the potential speedups using slicing, let’s look at a different example - a naive Python loop:

# try naive loop
num = 100_000_001
a = np.linspace(0, num - 1, num=num)
b = np.ndarray(num-1)

timing = {}

t1=time.time()

for i in range(len(a)-1):
    b[i] = a[i+1]

t2=time.time()
print("shift b w loop {} secs".format(t2-t1))
timing['loop'] = (t2-t1)
b[:2], b[-2:]

output:
shift b w loop 12.460822582244873 secs
(array([1., 2.]), array([9.9999999e+07, 1.0000000e+08]))

Next, we will try to speed this up using a list comprehension:

# try list comprehension 
t1=time.time()

b = [a[i+1] for i in range(len(a)-1)] # shift b by 1

t2=time.time()
print("shift b {} secs".format(t2-t1))
timing['list comprehension'] = (t2-t1)
b[:2], b[-2:]

output:
shift b 8.01001787185669 secs
([1.0, 2.0], [99999999.0, 100000000.0])

Next, we will try using a couple of nested NumPy UFuncs (roll and delete):

# try ufunc np.roll
t1=time.time()

b = np.delete( np.roll(a,-1) , -1)  # numpy roll and delete last value

t2=time.time()
print("shift b {} secs".format(t2-t1))
timing['RollDelete'] = (t2-t1)
b[:2], b[-2:]

Now try slicing — fancy slicing!

# try fancy slicing
t1=time.time()
####### Insert corrected code below

b = a[1:]

##################################

t2=time.time()
print("shift b {} secs".format(t2-t1))
timing['slicing'] = (t2-t1)
b[:2], b[-2:]
output:
shift b 3.24249267578125e-05 secs
(array([1., 2.]), array([9.9999999e+07, 1.0000000e+08]))

Now, plot the results.

plt.figure(figsize=(10,6))
plt.title("Time taken to process {:,} records in seconds".format(num),fontsize=12)
plt.ylabel("Time in seconds",fontsize=12)
plt.yscale('log')
plt.xlabel("Various types of operations",fontsize=14)
plt.grid(True)
plt.xticks(rotation=-60)
plt.bar(x = list(timing.keys()), height= list(timing.values()), align='center',tick_label=list(timing.keys()))
print('Acceleration : {:4.0g} X'.format(timing['loop']/(timing['slicing'])))

Graph 1: Speedup of slicing in this time series example: over 1x10⁵ X speedup!

Broadcasting:

According to Broadcasting SciPy.org, the term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

In a Linear Algebra class there is no such notion at broadcasting

In Linear Algebra:

Two matrices can be added or subtracted ONLY if their dimensions are exactly the same (let’s say (both MxN)
Two matrices can be multiplied ONLY if their inner dimensions are the same A(5,2) x B (2,7) = (5,7)

In NumPy and PyTorch, broadcasting is a way to speed up computation by essentially making the dimensions the correct size by copying data from the smaller dimension until the resulting array is suitable for Linear Algebra addition or multiplication.

The one exception: Outer product is the same in linear Algebra and NumPy, at least in the case of (Nx1) x (1xN) = (NxN)

Broadcasting rules and conditions:

A set of arrays can be considered ‘broadcastable’, if the below set of conditions are met:

The shape of the arrays are the same or the ending dimensions for both arrays match
If an array has a different number of dimension to another array, then the array with the lesser dimension is extended by 1 until the number of dimension of both arrays are equal
If two arrays are to be broadcastable, then the array with dimension size of 1 is to behave in a manner that enables the array with lesser dimension to be expanded by duplication to the length of the second array. (eg. a (3,1) is promoted to (3,3) by copying to fill dimension)

For better intuition about broadcasting — watch this excellent presentation by: Jake Vanderplas Losing your Loops Fast Numerical Computing with NumPy.

Example of normalizing California housing dataset

This example combines NumPy aggregations with broadcasting

But the NumPy vectorized code is MUCH more readable and easier to debug and explain. Here’s how the broadcasting works here for the Standardization method:

X is two dimensional array
X.mean(axis = 0) is a vector (1,3)
X.std(axis = 0) is a vector (1,3)
both small vectors (1,3) are broadcast up to the dimension of X (3,3)
in Linear algebra class you have to add/subtract arrays of same dimensions A is two dimensional array

Note: the scikit-learn standardize is about 2x faster that these methods. Always look to use optimized libraries such as these in deference to your own loops but also in deference to your own NumPy rolled one liners — but let performance and readability be your guide here.

from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)
X = california_housing.data.to_numpy()

def standardizeLoop(X): # about 15 lines of code, more to go wrong
    C = X.shape[-1]
    L = len(X)
    tmp = np.zeros(X.shape)

    for col in range(C):
        mean = 0
        total = 0
        for row in range(len(X)):
            total += X[row, col]
        mean = total/L
        diff2 = 0
        for row in range(L):
            d = X[row,col] - mean
            diff2 += d*d
            tmp[row, col] = X[row, col] - mean
        for row in range(L):
            tmp[row, col] = tmp[row, col]/np.sqrt(diff2/(L))
    return tmp

def standardize(A): # one line of code
#     A is two dimensional array
#     A.mean(axis = 0) is a vector (1,3)
#     A.std(axis = 0) is a vector (1,3)
#     both scalars are broadcast up to the dimension of A (3,3)
#     in Linear algebra Class you have to add/subtract arrays of same dimensions
    #return (A - A.mean(axis=0))/A.std(axis=0)
    return (A - np.mean(A, axis=0))/np.std(A, axis=0)

print("standardize Loop")
print(standardizeLoop(X))

print("standardize NumPy")
print(standardize(X))



timing = {}
t1 = time.time()
standardizeLoop(X)
timing['loop'] = time.time() -t1

t1 = time.time()
standardize(X)
timing['numpy'] = time.time() -t1
print(timing)
print(f"Acceleration {timing['loop']/timing['numpy']:4.1f} X")

output:
{'loop': 0.1822032928466797, 'numpy': 0.0007750988006591797}
Acceleration 235.1 X

In the above code, the broadcasting occurs in this line:

return (A - np.mean(A, axis=0))/np.std(A, axis=0)

A is an array of shape (10000,), while np.mean(A, axis=0) is a scalar with dimension ().

We are combining shapes of (10000,) and () which requires that the scalar be stretched to match the dimensions of (1000,) in order to perform the operation.

By using broadcasting, we speed up our code over naive loop method by over 200X on this particular Intel® Xeon® node.

Get the code for this article (Jupyter notebook: 08_04_NumPy_Broadcasting.ipynb) and the rest of the series on GitHub.

Next Steps

Try out this code sample using the standard free Intel® Tiber™ AI Cloud account and the ready-made Jupyter Notebook.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Intel Tiber AI Cloud System Configuration as tested:

x86_64, CPU op-mode(s): 32-bit, 64-bit, Address sizes: 52 bits physical, 57 bits virtual, Byte Order: Little Endian, CPU(s): 224, On-line CPU(s) list: 0–223, Vendor ID: GenuineIntel, Model name: Intel® Xeon® Platinum 8480+, CPU family: 6, Model: 143, Thread(s) per core: 2, Core(s) per socket: 56, Socket(s): 2
Stepping: 8, CPU max MHz: 3800.0000, CPU min MHz: 800.0000