Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
534 Discussions

Introduction to NumPy* Universal Functions (ufuncs)

Ramya_Ravi
Employee
0 0 12K

How I learned to stop worrying and let smart developers help me

This article was originally published on medium.com.

Posted on behalf of:

Bob Chesebrough, Solutions Architect, Intel Corporation

What are Universal Functions?

In NumPy, Universal Functions, ufuncs, are simply functions that operate on NumPy ND arrays in a conceptually element-by-element fashion, supporting array broadcasting, type casting, and several other standard features. In reality, many ufuncs are accelerated via a mechanism called vectorization — or the application of single instruction multiple data (SIMD) registers and HW level instructions, perhaps in combination with high level memory optimizations provided by a library or vectorizing “C” compiler, all baked into a pre-built stock version of NumPy*. Read my article “Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library” to get more detail on this.

Let’s Explore Some Examples of “ufuncs”

Universal Functions span a wide variety, likely thousands, of commonly used functions. From math “transcendental” functions, such as sines, cosines and tangents, to hyperbolic sines, cosines, and the like to inverse trig and inverse hyperbolic functions logarithm related and exponential related algorithms just to get started.


Math operations are included too, such as reciprocal, reciprocal, matmul and many, many more. Some of these functions can be invoked as “operators”, such as “+” — invoking addition, or “-“ invoking subtraction, and so on. Functions can be daisy chained too, such as:

np.reciprocal(np.sqrt(a))

Many bit twiddling functions are included to make it fast and efficient to compare or perform bit wise operations on one or two vectors. For example, invoking bitwise AND or bitwise OR, bitwise shifts of numbers, exclusive or operations etc.

Similarly, logical functions comparing each position of two vectors to identify which positions vector A is greater than or less than vector B, as well as similar logical functions such as finding the locations where two vectors are equal or performing a logical not on a single vector.
Then, floating point operations and test are included as well. Such as checking which locations in a vector is finite, where the “Not a Number” (nans) are located, finding where the elements of a vector are infinite (inf) etc.

Table 1 shows a partial summary of functions included as ufuncs

Table 1. Partial summary of Universal FunctionsTable 1. Partial summary of Universal Functions

In the code below, we see some simple minded code to round each element of a python list of floats:

import random
random.seed(42)
num = 10_000_000
a = []
for i in range(num): # assume we are given a random list, so don't time this
   a.append(random.random()*100) # don't time this b = []
for i in range(num):
   b.append(round(a[i],5))

 Compare this above code to the faster and more succinct list comprehension in below:

b = [round(x,5) for x in a] # assume we are given a random list

Note that in all these following code examples, I convert the python list to a NumPy array — in actual practice, this conversion happens one time near the top of the code, and then we freely reuse the NumPy arrays for all manipulations after that. The speedups attained are more impressive when we take this into consideration. Then compare to the NumPy equivalent below:

b = np.round(np.array(a), 5) #assume we start with python list

The acceleration gained from applying NumPy to replace the loop is remarkable and the readability of the NumPy code is on the same order as the readability of the list comprehension code. See Figure 1.

Figure 1. Compare wall clock time of different methods to Round a list of values. Measured on shared node on the new Intel® Developer Cloud : Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 CPU, 503GB RAM).Figure 1. Compare wall clock time of different methods to Round a list of values. Measured on shared node on the new Intel® Developer Cloud : Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 CPU, 503GB RAM).

Next I’ll examine composition of functions or nesting or daisy chaining functions. Consider computing the reciprocal of the square root of a value, which is common in physics and engineering. Compare the following three loops for readability and speed:

Naive Python Loop:

for i in range(len(a)):
 b.append(1.0/math.sqrt(a[i]))

Python List Comprehension:

b = [1.0/math.sqrt(x) for x in a]

NumPy composed functions of reciprocal and sqrt:

b = np.reciprocal(np.sqrt(a))

In Figure 2. we see the acceleration provided by combining NumPy ufuncs over the Naive Loop and list comprehension methods.

Figure 2. NumPy Nested ufuncs is faster than Naive loop and list comprehensions methods. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 CPU, 503GB RAM)Figure 2. NumPy Nested ufuncs is faster than Naive loop and list comprehensions methods. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 CPU, 503GB RAM)

Computing AI focused operations such as mean and std

Below is code to compute the mean and std of an array, first by rolling my own python loops. Below is the Naive loop:

S = 0
for i in range (len(a)): # compute the Sum
      S += a[I]
mean = S/len(a) # compute the mean
for i in range (len(a)): # zero the data around the mean
     d = a[i] - mean
     std += d*d
std = np.sqrt(std/len(a)) # compute the std

Below is the NumPy approach. Compare how much more readable and maintainable the NumPy code is. It is also faster on existing hardware and will likely be faster on future hardware or software library updates.

print(a.mean())
print(a.std())

Figure 3. Compare computing mean and std using a naive loop versus using NumPy. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 CPU, 503GB RAMFigure 3. Compare computing mean and std using a naive loop versus using NumPy. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel® Xeon® Platinum 8480L, 224 CPU, 503GB RAM

How can you determine if the Intel® oneAPI Math Kernel Library (oneMKL) library is accelerating your NumPy?

First of all, the Stock version of NumPy has the Intel oneMKL optimizations built in already. But if you wish to double check, run this python command:

np.show_config()

You should see output such as this — revealing the oneAPI libraries referenced by NumPy:

lapack_mkl_info:
         libraries = ['mkl_rt', 'pthread']
         library_dirs = ['/glob/development-tools/versions/oneapi/2023.1.2/oneapi/pytorch/1.13.10.2/lib']
         define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
          include_dirs = ['/glob/development-tools/versions/oneapi/2023.1.2/oneapi/pytorch/1.13.10.2/include']

Thousands of such examples could be provided but my hope is that I have motivated the use of ufuncs sufficiently for you to try them.

Get the code for this article (Jupyter notebook: 08_02_NumPy_UFUNCS.ipynb) and the rest of the series on GitHub.

Next Steps

Try out this code sample using the standard free Intel Developer Cloud account and the ready-made Jupyter Notebook.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions. 

Intel Developer Cloud System Configuration as tested:

x86_64, CPU op-mode(s): 32-bit, 64-bit, Address sizes: 52 bits physical, 57 bits virtual, Byte Order: Little Endian, CPU(s): 224, On-line CPU(s) list: 0–223, Vendor ID: GenuineIntel, Model name: Intel® Xeon® Platinum 8480+, CPU family: 6, Model: 143, Thread(s) per core: 2
Core(s) per socket: 56, Socket(s): 2, Stepping: 8, CPU max MHz: 3800.0000, CPU min MHz: 800.0000

About the Author
Product Marketing Engineer bringing cutting edge AI/ML solutions and tools from Intel to developers.