Python* Loop Replacement: NumPy* & PyTorch* Optimizations and Other Tensor Topics

Ramya_Ravi · ‎07-08-2024

Simple Stuff — N Dimensional (ND) array creation using NumPy, PyTorch, Data Parallel Control (dpctl)

This article was originally published on medium.com.

Posted on behalf of: Bob Chesebrough, Solutions Architect, Intel Corporation

This article demonstrates how and why users might want to convert to and from NumPy ND arrays.

To start, ND arrays are a gateway into fast optimizations provided under the under the hood via Intel® oneAPI Math Kernel Library (oneMKL), through NumPy ufuncs, aggregations, broadcasting, fancy slicing, sorting and so on. If you are not using these methods with NumPy arrays or equivalent PyTorch ones your progress is limited much like that of a person “running” in chest deep water. The goal is to show you how to get dry with NumPy arrays in preparation for running with NumPy ufuncs, aggregations and the like in the rest of these articles.

ND array from a python list

Below is a quick way to populate a ND array with a series of numbers from a pre-populated python list. It is likely that much of your legacy code is repeat with python lists. Here, I assume to preserve the random values generated by Python Random Uniform for testing purposes. The below code snippet shows how to convert them.

import random
 N = 100_000_000 # Number of records to process
 random.seed(42)
 L = [random.uniform(0,100) for _ in range(N)]
 …
 a = np.array(L)


 output: [63.942679845788376, 2.5010755222666936, 27.502931836911927, 
 22.321073814882276,…]

I urge you to strive for converting python lists early in the code and use the NumPy array through the rest of your code.

ND random array from scratch

Observe that the whole process of creating a list and then converting it can be a bit slow. How can I create a random ND array of the right size immediately?

import numpy as np
 np.random.seed(42)
 a = np.random.uniform(low=0, high=100, size=(N,))

 output: array([37.45401188, 95.07143064, 73.19939418, …, 37.52065796,
 59.14032368, 35.31581811])

Notice that even though I specified the same seed value, 42, for random as well as np.random, I get different values assigned!

Create a zero filled ND array of a specified size

One of the key reasons NumPy is faster than numerical approaches in Python is that a data type is assigned at array creation time, and that data type can be specified (for example: float, int32).

np.empty([N, 2])
# or
np.ndarray(shape=(N,2), dtype=float)
# or
np.zeros([N, 2])

  output: array([[0., 0.],
  [0., 0.],
  [0., 0.],
  …,
  [0., 0.],
  [0., 0.],
  [0., 0.]])

Create a filled ND array of a specified size

Similar to the above, you can create an array and fill it with ones or fill it easily with any constant using extensions. Also, you can create an “Identity matrix” very easily.

np.ones([N, 2])

 output: array([[1., 1.],
 [1., 1.],
 [1., 1.], 
 …,
 [1., 1.],
 [1., 1.],
 [1., 1.]])

 np.ones([N,2])*np.pi

 output: array([[3.14159265, 3.14159265],
 [3.14159265, 3.14159265],
 [3.14159265, 3.14159265],
 …,
 [3.14159265, 3.14159265],
 [3.14159265, 3.14159265],
 [3.14159265, 3.14159265]])

 np.eye(100)

 output: array([[1., 0., 0., …, 0., 0., 0.],
 [0., 1., 0., …, 0., 0., 0.],
 [0., 0., 1., …, 0., 0., 0.],
 …,
 [0., 0., 0., …, 1., 0., 0.],
 [0., 0., 0., …, 0., 1., 0.],
 [0., 0., 0., …, 0., 0., 1.]])

Fill an array with a sequence of values

Now, we use the np.linspace to fill a vector with a sequence of values.

np.linspace(0, 10, num=10 + 1) # create a list os floats 0.0, 1.0, 2.0 …

 output: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])

 # prepare sine wave, one period, 13 points
 N =12
 np.sin(2* np.pi * np.linspace(0, N, num=N + 1)/N) 

 output: array([ 0.00000000e+00, 5.00000000e-01, 8.66025404e-01, 1.00000000e+00,
 8.66025404e-01, 5.00000000e-01, 1.22464680e-16, -5.00000000e-01,
 -8.66025404e-01, -1.00000000e+00, -8.66025404e-01, -5.00000000e-01,
 -2.44929360e-16])

 # prepare a vector of values sampled within a given function
 def f(x):
 return 2*x**2 -5*x
 N = 100_000_000
 f( np.linspace(0, N, num=2*N + 1))

Replace loops with NumPy equivalent functions

Beware — replace the functionality of the **loop** not the functionality of the functions within the loop.

Replace the following loop and create a vector of sin values of a range of 10,000 elements.

# Simple minded brute force loop method
 a = []
 for i in range(10_000):
 a.append(a, np.sin(i))

 # SILLY — just replace the python append with NumPy append: BAD IDEA!
 for i in range(10_000): # oops did NOT replace the loop!
 a = np.append(a, np.sin(i)) #SLOWER than code above

 # Best: replace the entire loop!
 t = np.linspace(0, 10_000, num=10_000 + 1) 
 a= np.sin(t) # more readible, MANY times faster

Reshape and transpose arrays

Reshaping and transposing a matrix can be applied without using loops.

L = [1,2,3,4,5,6,7,8] # original list
 print(“Original List\n”,L)
 a = np.array(L).reshape((4,2)) # transformed into 4x2 ND array
 print(“reshape array tp have 4 rows, 2 columns\n”,a) # 4 rows x 2 columns
 print(“Compute Transpose\n”,a.T) # 2 columns x 4 rows

Expanding dimensions and squeezing dimensions helps to make the arrays multiply or add together properly.

np.array(L).shape
 print(“Dimensions of a\n”, np.array(L).shape) # dimensions of a
 print(“Expand dimensions\n”, np.expand_dims(np.array(L), axis = 1).shape) #epand
dimensions
 print(“squeeze empty dimensions\n”, np.expand_dims(np.array(L), axis =
1).squeeze().shape) #squeeze dimensions

Convert to and from PyTorch tensor

a = [1, 2, 3]
b = torch.Tensor(a) # better for neural nets

 output: tensor([1., 2., 3.])



a = [1, 2, 3]
b = torch.FloatTensor(a) # better for neural nets

 output: tensor([1., 2., 3.])


a = [1, 2, 3]
b = torch.IntTensor(a)

 output: tensor([1, 2, 3], dtype=torch.int32)


a = [1, 2, 3]
b = torch.Tensor(a).to(torch.int32)

 output: tensor([1, 2, 3], dtype=torch.int32)

a = [1, 2, 3]
b = torch.Tensor(a).to(torch.half)

 outputL tensor([1., 2., 3.], dtype=torch.float16)


a = [1, 2, 3]
b = torch.Tensor(a).to(torch.bfloat16) # Best used on HW that supports it natively

 tensor([1., 2., 3.], dtype=torch.bfloat16)

Convert a PyTorch tensor on an XPU to a NumPy ND array on the CPU

x_np = x.detach().cpu() # detach gets rid of autograd, # cpu specifies destination

Convert to and from Intel dpctl tensor

The dpctl library is a library from Intel used to create a dpctl tensor that can be used with with Data Parallel Python*. Use this for heterogeneous programming using Data Parallel Extension for Numba* for AI and HPC and compute “follows data” paradigm with certain optimized algorithms in Intel® Extension for Scikit-learn* accelerate these algorithms on Intel XPUs.

The dpctl conversion to and from NumPy (or from a python list) is easy.

 import dpctl
 X = np.array([[1, 2], [1, 4], [1, 0],
 [10, 2], [10, 4], [10, 0]]) 
 device = dpctl.select_default_device() # set device to XPU if one is available
 x_device = dpctl.tensor.asarray(X, device = device) # data now on XPU in variable X_device
 #calculation done on x_device are done on XPU

To convert dpctl tensor back to ND array is simple

#… calculations done with Intel Extensions for scikit-learn 
# assume function returns dpctl tensor called predictedGPU
dpctl.tensor.to_numpy(predictedGPU) # converts tensor to NumPy ND array

The above code snippets shows the basics of NumPy arrays, how to create them and manipulate them, convert them to and from PyTorch and dpctl tensors for use in Data Parallel Extension for Numba* and Intel Extension for Scikit-learn.

Check out the code for this article and the rest of the series is located on GitHub.

Next Steps

Try out this code sample using the standard free Intel® Tiber™ Developer Cloud account and the ready-made Jupyter Notebook.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Intel Developer Cloud System Configuration as tested:

x86_64, CPU op-mode(s): 32-bit, 64-bit, Address sizes: 52 bits physical, 57 bits virtual, Byte Order: Little Endian, CPU(s): 224, On-line CPU(s) list: 0–223, Vendor ID: GenuineIntel, Model name: Intel(R) Xeon(R) Platinum 8480+, CPU family: 6, Model: 143, Thread(s) per core: 2, Core(s) per socket: 56, Socket(s): 2
Stepping: 8, CPU max MHz: 3800.0000, CPU min MHz: 800.0000