Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
644 Discussions

Clustering Time Series with PCA and DBSCAN

Ramya_Ravi
Employee
0 0 1,313

Accelerating These Algorithms Using Intel® Extension for Scikit-learn*

 

This article was originally published on medium.com.

Posted on behalf of: Bob Chesebrough

 

In this article, we’ll explore the clustering of time series data using Principal Component Analysis (PCA) for dimensionality reduction and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for clustering. This technique identifies patterns in time series data, such as traffic flow in a city, without requiring labeled data. We’ll be using Intel Extension for Scikit-learn to accelerate performance.

Time series data often exhibit repetitive patterns due to human behavior, machinery, or other measurable sources. Identifying these patterns manually can be challenging. Unsupervised learning approaches like PCA and DBSCAN enable us to discover these patterns.

 

Methodology

 

Data Generation

 

We generate synthetic waveform data to simulate time series patterns. The data consists of three distinct waveforms, each with added noise to simulate real-world variability. We’ll use the scikit-learn agglomerative clustering example authored by Gael Varoquaux (Figure 1). It is available under BSD 3-Clause or CC-0 licenses.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
n_features = 2000
t = np.pi * np.linspace(0, 1, n_features)

def sqr(x):
    return np.sign(np.cos(x))

X = []
y = []
for i, (phi, a) in enumerate([(0.5, 0.15), (0.5, 0.6), (0.3, 0.2)]):
    for _ in range(30):
        phase_noise = 0.01 * np.random.normal()
        amplitude_noise = 0.04 * np.random.normal()
        additional_noise = 1 - 2 * np.random.rand(n_features)
        additional_noise[np.abs(additional_noise) < 0.997] = 0
        X.append(12 * ((a + amplitude_noise) * (sqr(6 * (t + phi + phase_noise))) + additional_noise))
        y.append(i)

X = np.array(X)
y = np.array(y)

plt.figure()
plt.axes([0, 0, 1, 1])
for l in range(3):
    plt.plot(X[y == l].T, alpha=0.5, label=f'Waveform {l+1}')
plt.legend(loc='best')
plt.title('Unlabeled Data')
plt.show()

 

Picture1.png

 

Figure 1. Code and plot generated by author from scikit-learn agglomerative clustering algorithm developed by Gael Varoquaux

 

Accelerating PCA and DBSCAN Using Intel Extension for Scikit-learn

 

Both PCA and DBSCAN can be accelerated via patching scheme using Intel Extension for Scikit-learn. Scikit-learn (often referred to as sklearn) is a Python module for machine learning (ML). Intel Extension for Scikit-learn is one of Intel’s AI tools that seamlessly accelerates scikit-learn applications on Intel CPUs and GPUs in single- and multi-node configurations. It dynamically patches scikit-learn estimators to improve ML training and inference by up to 100x with equivalent mathematical accuracy (Figure 2).

 

scikit-learn-acceleration.png

Figure 2. Intel Extension for Scikit-learn GitHub Repository

 

Intel Extension for Scikit-learn uses the scikit-learn API and can be enabled from the command-line or by modifying a couple of lines of your Python application prior to importing scikit-learn:

from sklearnex import patch_sklearn
patch_sklearn()

 

Dimensionality Reduction Using PCA

 

Before attempting to cluster 90 samples, each containing 2,000 features, we will use PCA to reduce dimensionality while retaining 99% of the variance in the dataset:

from sklearn.decomposition import PCA

pca = PCA(n_components=4)
XPC = pca.fit_transform(X)

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Singular values:", pca.singular_values_)
print("Shape of XPC:", XPC.shape)

We use a pairplot to look for visible clusters in the reduced data (Figure 3):

import pandas as pd
import seaborn as sns

df = pd.DataFrame(XPC, columns=['PC1', 'PC2', 'PC3', 'PC4'])
sns.pairplot(df)
plt.show()

 

 

Picture3.png

 

Figure 3. Looking for clusters in the data after dimensionality reduction

 

Clustering with DBSCAN

 

Based on the pairplot, PC1 and PC2 seem to separate the clusters well, so we’ll use these components for DBSCAN clustering. We can also get an estimate of the DBSCAN EPS parameter. Here, the value chosen is 50 because the PC1 vs PC0 diagram suggests that this is a reasonable separation distance for the observed clusters:

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=50, min_samples=3).fit(XPC[:, [0, 1]])
labels = clustering.labels_

print("Cluster labels:", labels)

We can plot the clustered data to see how well DBSCAN has identified the clusters (Figure 4):

plt.figure()
plt.axes([0, 0, 1, 1])
colors = ["#f7bd01", "#377eb8", "#f781bf"]
for l, color in zip(range(3), colors):
    plt.plot(X[labels == l].T, c=color, alpha=0.5, label=f'Cluster {l+1}')
plt.legend(loc='best')
plt.title('PCA + DBSCAN')
plt.show()

 

Picture4.png

Figure 4. Plot of clustered data generated using the previous code example

 

Comparing with Ground Truth

 

As you can see from Figure 4, the DBSCAN does a good job at finding the plausible colored clusters and compares well to the original ground truth data (Figure 1). In this case, the clustering recovered the underlying patterns used to generate the data perfectly. By using PCA for dimensionality reduction and DBSCAN for clustering, we can effectively identify and label patterns in time series data. This approach allows for the discovery of underlying structures in the data without the need for labeled samples.

 

What’s Next 

 

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions. 

 

Useful resources 

 

About the Author
Product Marketing Engineer bringing cutting edge AI/ML solutions and tools from Intel to developers.