Clustering Time Series with PCA and DBSCAN

Ramya_Ravi · ‎11-19-2024

Accelerating These Algorithms Using Intel® Extension for Scikit-learn*

This article was originally published on medium.com.

Posted on behalf of: Bob Chesebrough

In this article, we’ll explore the clustering of time series data using Principal Component Analysis (PCA) for dimensionality reduction and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for clustering. This technique identifies patterns in time series data, such as traffic flow in a city, without requiring labeled data. We’ll be using Intel Extension for Scikit-learn to accelerate performance.

Time series data often exhibit repetitive patterns due to human behavior, machinery, or other measurable sources. Identifying these patterns manually can be challenging. Unsupervised learning approaches like PCA and DBSCAN enable us to discover these patterns.

Methodology

Data Generation

We generate synthetic waveform data to simulate time series patterns. The data consists of three distinct waveforms, each with added noise to simulate real-world variability. We’ll use the scikit-learn agglomerative clustering example authored by Gael Varoquaux (Figure 1). It is available under BSD 3-Clause or CC-0 licenses.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
n_features = 2000
t = np.pi * np.linspace(0, 1, n_features)

def sqr(x):
    return np.sign(np.cos(x))

X = []
y = []
for i, (phi, a) in enumerate([(0.5, 0.15), (0.5, 0.6), (0.3, 0.2)]):
    for _ in range(30):
        phase_noise = 0.01 * np.random.normal()
        amplitude_noise = 0.04 * np.random.normal()
        additional_noise = 1 - 2 * np.random.rand(n_features)
        additional_noise[np.abs(additional_noise) < 0.997] = 0
        X.append(12 * ((a + amplitude_noise) * (sqr(6 * (t + phi + phase_noise))) + additional_noise))
        y.append(i)

X = np.array(X)
y = np.array(y)

plt.figure()
plt.axes([0, 0, 1, 1])
for l in range(3):
    plt.plot(X[y == l].T, alpha=0.5, label=f'Waveform {l+1}')
plt.legend(loc='best')
plt.title('Unlabeled Data')
plt.show()

Figure 1. Code and plot generated by author from scikit-learn agglomerative clustering algorithm developed by Gael Varoquaux

Accelerating PCA and DBSCAN Using Intel Extension for Scikit-learn

Both PCA and DBSCAN can be accelerated via patching scheme using Intel Extension for Scikit-learn. Scikit-learn (often referred to as sklearn) is a Python module for machine learning (ML). Intel Extension for Scikit-learn is one of Intel’s AI tools that seamlessly accelerates scikit-learn applications on Intel CPUs and GPUs in single- and multi-node configurations. It dynamically patches scikit-learn estimators to improve ML training and inference by up to 100x with equivalent mathematical accuracy (Figure 2).

Figure 2. Intel Extension for Scikit-learn GitHub Repository

Intel Extension for Scikit-learn uses the scikit-learn API and can be enabled from the command-line or by modifying a couple of lines of your Python application prior to importing scikit-learn:

from sklearnex import patch_sklearn
patch_sklearn()

Dimensionality Reduction Using PCA

Before attempting to cluster 90 samples, each containing 2,000 features, we will use PCA to reduce dimensionality while retaining 99% of the variance in the dataset:

from sklearn.decomposition import PCA

pca = PCA(n_components=4)
XPC = pca.fit_transform(X)

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Singular values:", pca.singular_values_)
print("Shape of XPC:", XPC.shape)

We use a pairplot to look for visible clusters in the reduced data (Figure 3):

import pandas as pd
import seaborn as sns

df = pd.DataFrame(XPC, columns=['PC1', 'PC2', 'PC3', 'PC4'])
sns.pairplot(df)
plt.show()

Figure 3. Looking for clusters in the data after dimensionality reduction

Clustering with DBSCAN

Based on the pairplot, PC1 and PC2 seem to separate the clusters well, so we’ll use these components for DBSCAN clustering. We can also get an estimate of the DBSCAN EPS parameter. Here, the value chosen is 50 because the PC1 vs PC0 diagram suggests that this is a reasonable separation distance for the observed clusters:

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=50, min_samples=3).fit(XPC[:, [0, 1]])
labels = clustering.labels_

print("Cluster labels:", labels)

We can plot the clustered data to see how well DBSCAN has identified the clusters (Figure 4):

plt.figure()
plt.axes([0, 0, 1, 1])
colors = ["#f7bd01", "#377eb8", "#f781bf"]
for l, color in zip(range(3), colors):
    plt.plot(X[labels == l].T, c=color, alpha=0.5, label=f'Cluster {l+1}')
plt.legend(loc='best')
plt.title('PCA + DBSCAN')
plt.show()

Figure 4. Plot of clustered data generated using the previous code example

Comparing with Ground Truth

As you can see from Figure 4, the DBSCAN does a good job at finding the plausible colored clusters and compares well to the original ground truth data (Figure 1). In this case, the clustering recovered the underlying patterns used to generate the data perfectly. By using PCA for dimensionality reduction and DBSCAN for clustering, we can effectively identify and label patterns in time series data. This approach allows for the discovery of underlying structures in the data without the need for labeled samples.

What’s Next

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.