Not your SQL Select clause — Using Where and Select to vectorize conditional logic
This article was originally published on medium.com.
Posted on behalf of: Bob Chesebrough, Solutions Architect, Intel Corporation
This article demonstrates how to vectorize a loop even though it contains tricky conditional logic.
The steps required are:
- Find Large loop iteration loop
- If it contains conditional logic, consider PyTorch.where, np.where or np.select
- else try to find a NumPy or PyTorch replacement using UFuncs, aggregations, etc
NumPy.where & PyTorch.where
Figure 1. Where statement is for more simple logic conditions
One thing that could prevent us from effectively getting vector performance when converting a loop to a vector approach is when the original loop has if then else statements in it — called conditional logic. NumPy & PyTorch Where allows us to tackle conditional loops in a fast vectorized way and apply conditional logic to an array to create a new column or update contents of an existing column.
Syntax:
- numpy.where(condition, [x, y, ]/)
- Return elements chosen from x or y depending on condition.
Let’s look at the simple example below to understand what NumPy Where does which shows how to add 50 to all elements greater than 5.
a = np.arange(10)
np.where(a > 5, a + 50, a )
# if a > 5 then return a + 50
# else return a
output:
array([ 0, 1, 2, 3, 4, 5, 56, 57, 58, 59])
The above example could come in handy for many AI applications, but let’s choose labeling data and there may be better ways to binarize data but here is a simple example of converting continuous data into categorical values.
arr = np.array([11, 1.2, 12, 13, 14, 7.3, 5.4, 12.5])
Let’s say all values 10 and above represent a medical parameter threshold that indicates further testing and values below 10 indicate normal range.
So, the output words will be something like [‘More Testing’, ‘Normal’, ‘More Testing’, ‘More Testing’, …].
arr = np.array([11, 1.2, 12, 13, 14, 7.3, 5.4, 12.5])
np.where(arr < 10, 'Normal', 'More Testing')
output:
array(['More Testing', 'Normal', 'More Testing', 'More Testing',
'More Testing', 'Normal', 'Normal', 'More Testing'], dtype='<U12')
Now, let’s binarize data for use in a classifier:
# Simple NumPy Binarizer Discretizer
# convert continuous data to discrete integer bins
arr = np.array([11, 1.2, 12, 13, 14, 7.3, 5.4, 12.5])
print(np.where(arr < 6, 0, np.where(arr < 12, 1, 2)))
output:
[1 0 2 2 2 1 0 2]
NumPy where can be used to create index masks so that you can select or update masked items from an array.
In this example, I want to find all(rows, cols) of (all **multiples of **12 or all multiples of 9) in a 10x10 multiplication table and make all other values 0. Preserve the first row and first column as readable indexes for the table as follows:
## one solution - preserves the indexing edges for easy checking
res = 0
re((MultiplicationTable%12 == 0) | (MultiplicationTable%9 == 0), MultiplicationTable, 0)
res[0,:] = MultiplicationTable[0,:]
res[:,0] = MultiplicationTable[:,0]
res
output:
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 0, 0, 0, 0, 12, 0, 0, 18, 0],
[ 3, 0, 9, 12, 0, 18, 0, 24, 27, 0],
[ 4, 0, 12, 0, 0, 24, 0, 0, 36, 0],
[ 5, 0, 0, 0, 0, 0, 0, 0, 45, 0],
[ 6, 12, 18, 24, 0, 36, 0, 48, 54, 60],
[ 7, 0, 0, 0, 0, 0, 0, 0, 63, 0],
[ 8, 0, 24, 0, 0, 48, 0, 0, 72, 0],
[ 9, 18, 27, 36, 45, 54, 63, 72, 81, 90],
[10, 0, 0, 0, 0, 60, 0, 0, 90, 0]])
NumPy Where applied to California Housing data
In AI context, this could be applying categorical classifier to otherwise continuous values. For example, in California housing dataset, the target price variable is continuous. Now, in the following fictitious scenario, a new stimulus package is considered whereby new house buyers will be given a coupon worth 50,000 off toward purchase of houses in California whose price (prior to coupon) is between 250,0000 and 350,000. Other prices will be unaffected. Generate array with the adjusted targets
Let’s compare the naive Python loop versus the NumPy Where clause — examine for readability, maintainability, speed, etc.
# Fictitious scenario:
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.data.to_numpy()
buyerPriceRangeLo = 250_000/100_000
buyerPriceRangeHi= 350_000/100_000
T = california_housing.target.to_numpy()
t1 = time.time()
timing = {}
New = np.empty_like(T)
for i in range(len(T)):
if ( (T[i] < buyerPriceRangeHi) & (T[i] >= buyerPriceRangeLo)
New[i] = T[i] - 50_000/100_000
else:
New[i] = T[i]
t2 = time.time()
plt.title( "California Housing Dataset - conditional Logic Applied")
plt.scatter(T, New, color = 'b')
plt.grid()
print("time elapsed: ", t2-t1)
timing['Loop'] = t2-t1
Figure 2. Naive loop implementation of house price coupon
Next, examine the NumPy Where approach:
t1 = time.time()
#############################################################################
### Exercise: Add one modification in the code below to compute same results as above loop
New = np.where((T < buyerPriceRangeHi) & (T >= buyerPriceRangeLo), T - 50_000/100_000, T )
##############################################################################
t2 = time.time()
plt.scatter(T, New, color = 'r')
plt.grid()
print("time elapsed: ", t2-t1)
timing['np.where'] = t2-t1
print("Speedup: {:4.1f}X".format( timing['Loop']/timing['np.where']))
Figure 3. NumPy Where implementation of house price coupon
We can observe that, we generated the same data with NumPy where as we did with the original loop, but we did so 13X faster (the speedup amount may vary a bit).
NumPy Select statement:
Figure 4. Select statement is for more simple logic conditions
The select statement is available in NumPy, but not yet in PyTorch, although we will provide a code snippet which can emulate the select clause using PyTorch. The select statement handle much more complex logic conditions as compared to the where statement.
Apply conditional logic to an array to create a new column or update contents of an existing column. This method handles more complex conditional scenarios than NumPy where.
Syntax:
- [numpy.select(condlist, choicelist, default=0)[source]
- Return an array drawn from elements in choicelist, depending on conditions.
This is very useful function for handing conditionals that otherwise slow down and map or apply, or else add complexity in reading the code.
In the below example, we will create some new data:
import numpy as np
import time
BIG = 10_000_000
np.random.seed(2022)
A = np.random.randint(0, 11, size=(BIG, 6)
Then, apply some crazy logic to update various columns of the array.
timing = {}
t1 = time.time()
for i in range(BIG):
if A[i,4] == 10:
A[i,5] = A[i,2] * A[i,3]
elif (A[i,4] < 10) and (A[i,4] >=5):
A[i,5] = A[i,2] + A[i,3]
elif A[i,4] < 5:
A[i,5] = A[i,0] + A[i,1]
t2 = time.time()
baseTime = t2- t1
print(A[:5,:])
print("time: ", baseTime)
timing['Naive Loop'] = t2 - t1
output:
[[ 0 1 1 0 7 1]
[ 2 8 0 5 9 5]
[ 3 8 0 3 6 3]
[ 0 10 10 1 2 10]
[ 5 7 5 1 7 6]]
time: 5.937685012817383
Try Vectorizing with masks
Here, remove the references to i, remove the loop, and create mask for each condition
# Try Vectorizing simply
t1 = time.time()
mask1 = A[:,4] == 10
A[mask1,5] = A[mask1,2] * A[mask1,3]
mask2 = ((A[:,4].any() < 10) and (A[:,4].any() >=5))
A[mask2,5] = A[mask2,2] + A[mask2,3]
mask3 = A[:,4].any() < 5
A[mask3,5] = A[mask3,0] + A[mask3,1]
t2 = time.time()
print(A[:5,:])
print("time :", t2-t1)
fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['Vector Masks'] = t2 - t1
output:
[[ 0 1 1 0 7 1]
[ 2 8 0 5 9 5]
[ 3 8 0 3 6 3]
[ 0 10 10 1 2 10]
[ 5 7 5 1 7 6]]
time : 0.23482632637023926
Speed up: 25 X
Next, try NumPy.select — Much cleaner logic
Here, we need to put the condition inside a list of tuples, execution choice inside a list of tuples, and choose a default action.
condition = [ (A[:,4] < 10) & (A[:,4] >= 5),
( A[:,4] < 5)]
choice = [ (A[:,2] + A[:,3]),
(A[:,0] + A[:,1] ) ]
default = [(A[:,2] * A[:,3])]
A[:,5] = np.select(condition, choice, default= default )
output:
[[ 0 1 1 0 7 1]
[ 2 8 0 5 9 5]
[ 3 8 0 3 6 3]
[ 0 10 10 1 2 10]
[ 5 7 5 1 7 6]]
time : 0.4723508358001709
Speed up: 13 X
plt.figure(figsize=(10,6))
plt.title("Time taken to process {:,} records in seconds".format(BIG),fontsize=12)
plt.ylabel("Time in seconds",fontsize=12)
plt.xlabel("Various types of operations",fontsize=14)
plt.grid(True)
plt.xticks(rotation=-60)
plt.bar(x = list(timing.keys()), height= list(timing.values()), align='center',tick_label=list(timing.keys()))
From the above figure, we can observe that 13X speedup over Naive Python loop when using this NumPy.select in this simple example.
Get the code for this article (Jupyter notebook: 08_05_NumPy_Where_Select.ipynb) and the rest of the series on GitHub.
Next Steps
Try out this code sample using the standard free Intel® Tiber™ AI Cloud account and the ready-made Jupyter Notebook.
We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.
Intel Tiber AI Cloud System Configuration as tested:
x86_64, CPU op-mode(s): 32-bit, 64-bit, Address sizes: 52 bits physical, 57 bits virtual, Byte Order: Little Endian, CPU(s): 224, On-line CPU(s) list: 0–223, Vendor ID: GenuineIntel, Model name: Intel® Xeon® Platinum 8480+, CPU family: 6, Model: 143, Thread(s) per core: 2, Core(s) per socket: 56, Socket(s): 2
Stepping: 8, CPU max MHz: 3800.0000, CPU min MHz: 800.0000
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.