Loop Replacement Strategies: Applications to Pandas Apply

Ramya_Ravi · ‎11-21-2024

This article was originally published on medium.com.

Posted on behalf of: Bob Chesebrough

This article shows how to apply the NumPy select tricks (explained in the article) to accelerate the Pandas Apply statement hindered by conditional logic. Code for this article is available on GitHub.

Here, we will cover the following topics:

Apply WHERE or SELECT in NumPy powered by oneAPI to dramatically speed up certain common Pandas bottlenecks
Achieve good performance using NumPy SELECT on a Pandas data frame
Achieve even better performance by converting the date frame to NumPy arrays

It is important to know how to speed up Pandas natively via its dependence on NumPy. Pandas is powered by oneAPI via NumPy!

When the opportunity arises it often highly profitable to leverage the NumPy way of solving a Pandas apply() performance issue. Due to the nature of the size of many data frames, it is often better to uncover a way to apply NumPy instead.

While not yet a part of this guide, AI Tools from Intel has a component called Modin* which is a drop in replacement for Pandas and this package can dramatically speed up Pandas operations. Modin can be used for problems larger than can fit in your laptops memory for example and can distribute computations across a cluster of nodes. Our aim is to include Modin as a component of training in the future articles.

Let’s start with a conjured example of a weird conditional logic applied to columns of a Pandas data frame with an expensive log function thrown in for good measure.

def my_function(x):
    return np.log(1+x)

def func(a,b,c,d,e):
    if e == 10:
        return c*d
    elif (e < 10) and (e>=7):
        return my_function(c+d)
    elif e < 7:
        return my_function(a+b+100)

Now we use our Pandas Apply to establish a timing baseline.

%%time
# naive loop method using pandas loc
import numpy as np

# each iteration of the loop requires an interpretation of the instructions being used and this decoding takes time
    
t1 = time.time()

df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)

t2 = time.time()
print("time : {:5.2f}".format(t2-t1))
df.head()
baseTime = t2-t1
timing['Pandas Apply'] =  t2 - t1
df.head()

output:
time :  8.37
CPU times: user 8.36 s, sys: 15.9 ms, total: 8.37 s
Wall time: 8.38 s

Can we achieve better performance? - Let’s apply another trick called masking. We will do the conditional logic to see an index or mask for our data frame and use different masks for different conditions.

# masked approach
t1 = time.time()
df['new'] = df['c']  * df['d']  #default case e =10
mask = (df['e'] < 10) & (df['e'] >= 7)
df.loc[mask,'new'] = (df['c'] + df['d']).apply(lambda x : my_function(x))
mask = df['e'] < 7
df.loc[mask,'new'] = (df['a'] + df['b']).apply(lambda x : my_function(x + 100))
t2 = time.time()
print("time :", t2-t1)
fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['unrolled with masks on df'] = t2 - t1
df.head()

output:
time : 1.6978118419647217
Speed up:    5 X

WOW! Masking to the rescue!

Can we accelerate more? - Let’s try Numpy “Select” clause. With NumPy “Select” clause, we can notice that it cleans the code up a lot.

Create a list of tuple containing your condition.
Create another list of tuples containing the operation you wish to apply
Call np.select(condlist, choicelist, default=0)

# np.select(condlist, choicelist, default=0)
t1 = time.time()
################### add code here ###########
condition = [ (df['e'] < 10) & (df['e'] >= 7),
              ( df['e'] < 7)]
choice = [ (df['c'] + df['d']).apply(lambda x : my_function(x) ), 
           (df['a'] + df['b']).apply(lambda x : my_function(x + 100) ) ]
default = (df['c'] * df['d'])
np.select(condition, choice, default = default )
#############################################
np.select(condition, choice, default = default)
t2 = time.time()
print("time :", t2-t1)
timing['Numpy Select on Pandas df'] = t2 - t1
df.head()

output:
time : 1.684462308883667

Using NumPy.select with Pandas data frame operations yield a speedup of about 5X!

Could we speed it up more if we drop the Pandas and go completely with NumPy? Maybe we can try masking, it is usually faster.

# Convert Pandas to numpy entirely
t1 = time.time()
npArr = df.to_numpy()  # convert to numpy
idx = {}  #intialize an indexing dictionary
for index, value in enumerate(df.columns):
    idx[value] = index
df.loc[:,'new'] = npArr[:,idx['c']] * npArr[:,idx['d']] #default case e =10
mask = (npArr[:,idx['e']] < 10) & (npArr[:,idx['e']] >= 7)
df.loc[mask,'new'] =  my_function(npArr[mask,idx['c']] + npArr[mask,idx['d']])
mask = (npArr[:,idx['e']] < 7)
df.loc[mask,'new'] = my_function(npArr[mask,idx['a']] + npArr[mask,idx['b']]  + 100)
t2 = time.time()
print("time :", t2-t1)
df.head()
fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['unrolled with Masks on dataframe'] = t2 - t1
df.head()

output:
time : 0.06982827186584473
Speed up:  120 X

WOW!! Now we are talking — something over 120X speedup!

Code looks a little messy though. Can we try the Numpy.Select trick again?

# np.select(condlist, choicelist, default=0)
# Convert Pandas to numpy entirely
t1 = time.time()
npArr = df.to_numpy()  # convert to numpy

condition = [ (npArr[:,idx['e']] < 10) & (npArr[:,idx['e']] >= 7),
              (npArr[:,idx['e']] < 7)]

choice = [(my_function(npArr[:,idx['c']] + npArr[:,idx['d']]      )), 
          (my_function(npArr[:,idx['a']] + npArr[:,idx['b']] + 100))]

tmp = np.select(condition, choice, default= (npArr[:,idx['c']] * npArr[:,idx['d']]) )

df.loc[:,'new'] = tmp
t2 = time.time()

print("time :", t2-t1)

fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['Numpy Select Pure'] = t2 - t1
df.head()

output:
time : 0.041852712631225586
Speed up:  200 X

200X speedup! Wow! We can notice that it is hundreds time faster than Pandas Apply and code is cleaner.

Let’s plot the results.

Figure: Speedups plotted using code referenced below on Xeon system referenced below

Next Steps

Try out this code sample using the standard free Intel® Tiber™ AI Cloud account and the code for this article is available in GitHub.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Intel® Tiber™ AI Cloud System Configuration as tested:

x86_64, CPU op-mode(s): 32-bit, 64-bit, Address sizes: 52 bits physical, 57 bits virtual, Byte Order: Little Endian, CPU(s): 224, On-line CPU(s) list: 0–223, Vendor ID: GenuineIntel, Model name: Intel® Xeon® Platinum 8480+, CPU family: 6, Model: 143, Thread(s) per core: 2, Core(s) per socket: 56, Socket(s): 2, Stepping: 8, CPU max MHz: 3800.0000, CPU min MHz: 800.0000