Intel Labs Accelerates Single-cell RNA-Seq Analysis

Sanchit_Misra · ‎06-07-2022

Sanchit Misra is a senior research scientist and leads the efforts in computational biology/HPC research at Intel Labs. This article is co-authored by Narendra Chaudhary who is also a research scientist and conducts computational biology/HPC research at Intel Labs.

Highlights:

Single-cell RNA-Seq (scRNA-Seq) analysis is an advanced approach for single-cell analysis that enables genetic insights at cellular levels by decoding the gene expression of a single cell.
Intel Labs has accelerated a Scanpy based pipeline for scRNA-Seq analysis by nearly 40 times over the CPU baseline to achieve analysis of 1.3 million mouse cells in just seven and a half minutes on a single CPU instance on GCP. This is nearly 1.5 times faster than the performance of a single A100 GPU. This has been accomplished via parallel algorithms, several architecture specific optimizations and enhancing underlying libraries, such as oneDAL and KatanaGraph, in close collaborations with the respective teams.

Dramatic increase in resolution of measurement has always revolutionized fields. For example, the incredible scientific impact of the invention of microscope and telescope. Single-cell analysis is a primary example of a similar revolution unfolding in biology. The human body is made up of nearly 40 trillion cells. Historically, these cells have been examined in bulk, sometimes millions of cells at a time, which cannot capture the differences across cells. Single-cell analysis is a field devoted to the study of the individuality of cells. It is beginning to unravel the mystery of cell differentiation by identifying novel cell types, revealing mechanisms that make these cells different from each other, and demonstrating how cells respond to certain diseases or drugs. This relatively new field is already showing immense potential for biological discoveries with applications ranging from cancer to Covid-19 related research.

The amount of single-cell data is increasing at a rapid pace thanks to the advancement of data measurement technologies. The size of individual datasets is increasing at a similar rate. Analysis of this data typically involves running a data science pipeline. Because the steps of the pipeline are often repeated with changes in parameters, it helps to have an interactive pipeline that can run in near real time.

ScRNA-seq Analysis of 1.3 Million Mouse Cells in Just Seven and half Minutes on a Single CPU
There are many kinds of single-cell analyses studying various aspects of cell-differentiation. Single-cell RNA-seq (scRNA-seq) analysis studies the differences in gene expression profiles across cells. It relies on single-cell RNA sequencing, which is an advanced technique that enables measurement of the gene expression of individual cells.

A typical workflow to do scRNA-seq analysis begins with a matrix that consists of the expression levels of the genes in each cell. In the data preprocessing steps, noise is filtered out and the data is normalized to obtain the activity of every human gene in each individual cell of the dataset. During this step, machine learning is often utilized to correct artifacts from data collection. Subsequently, dimensionality reduction is performed followed by clustering to group cells with similar genetic activity and visualization of the clusters. With over 800,000 downloads, Scanpy is one of the most widely used toolkits for this analysis.

Figure 1: Pipeline showing the steps in analysis of single-cell RNA sequencing data starting from gene activity matrix to visualization of different cell clusters.

For a dataset consisting of 1.3 million mouse brain cells, the pipeline depicted above in Figure 1 would normally take nearly 5 hours on a single CPU instance (n1-highmem-64) on GCP using off-the-shelf (baseline) Scanpy implementation. For the same pipeline, Nvidia has reported an end-to-end runtime of 686 seconds on a single A100 GPU using Nvidia RAPIDS.

At Intel Labs, we collaborated with the Intel® oneDAL team and Katana Graph, to accelerate the pipeline by using better parallel algorithms and tuning the performance to the underlying architecture. While this is still a work-in-progress, the table and chart below report our current performance and cloud usage costs. These results were recently presented at Intel Investor Day 2022. The whole pipeline can now be finished on the same single CPU instance (n1-highmem-64) on GCP in just 626 seconds. This performance only gets better with the newer n2 instance types running 3rd Generation Intel® Xeon® Scalable Processors (Ice Lake). We also reduced the memory requirement of the pipeline so that we can use the low memory n2-highcpu-64 instances instead of high memory n2-highmem-64 instances. On a single instance of n2-highcpu-64 on GCP, the whole pipeline finishes in just 459 seconds (7.65 mins). This is nearly 40 times faster than the 5-hour CPU baseline that we started with. This is also nearly 1.5 times faster than Nvidia A100 performance.

The speedup and reduction in memory requirement has resulted in significant reduction in cloud costs. As seen in the table, the n2-highcpu-64 instance on GCP costs only $ 0.29. This is nearly 66 times cheaper than n1-highmem-64 running baseline Scanpy and 2.4 times cheaper than Nvidia A100 GPU.

Table 1: Execution time and cloud costs for scRNA-seq analysis of 1.3 million mouse brain cells on various GCP instances. The first two columns report published execution time and cloud costs of baseline Scanpy on a single CPU instance (n1-highmem-64) and GPU-accelerated Scanpy on a single GPU instance (a2-highgpu-1g). The last three columns report measured** execution time and cloud costs of CPU-accelerated Scanpy on single instances of two generations of CPU instance types (n1-highmem-64, n1-highmem-64 and n2-highcpu-64).

Figure 2: Execution time and speedup for scRNA-seq analysis of 1.3 million mouse brain cells on various GCP instances. The chart uses (1) published execution time of baseline Scanpy on a single CPU instance (n1-highmem-64) and GPU-accelerated Scanpy on a single GPU instance (a2-highgpu-1g), and (2) measured** execution time of CPU-accelerated Scanpy on single instances of two generations of CPU instance types (n1-highmem-64, n2-highmem-64 and n2-highcpu-64). In addition, the line graph shows the speedup over baseline Scanpy running on n1-highmem-64 instance.

*As mentioned on this link on May 15, 2022: https://cloud.google.com/compute/vm-instance-pricing
**Test by Intel as of May 25, 2022

How Was the Data Science Pipeline Accelerated?

Detailed below is a brief summary of the steps we took to improve the performance of this pipeline.

To increase the efficiency of data preprocessing, we used warm file cache and multi-threaded using Numba, a just-in-time (JIT) compiler. This improved the baseline preprocessing performance by more than 70 times.
We also used the Intel extension for scikit-learn that has efficient implementations of K-means clustering, KNN (K Nearest Neighbor) and PCA (Principal Component Analysis).
Scanpy originally used scikit-learn’s tSNE (t-distributed Stochastic Neighbor Embedding) implementation that was inefficient for Xeon. We achieved nearly 40 times speedup of tSNE by building an efficient implementation through:
- A shared memory parallel implementation of the Barnes-Hut algorithm
- Parallelization of quadtree building, sorting, and summarization steps using Morton codes
Continuing our efforts, we optimized (Uniform Manifold Approximation and Projection) by:
- Converting the Python source code to C++
- Creating an efficient AVX512/AVX2 based implementation for pseudo random number generator
- Using Intel oneAPI Math Kernel Library (MKL) for the eigenvalue computation step
As part of our collaboration, the Katana Graph team built efficient implementations of Louvain and Leiden algorithms which were integrated into the pipeline.

These developments significantly reduce the time it takes to analyze large datasets, allowing researchers to complete their work 40 times faster on a CPU and 1.5 times faster than an Nvidia A100 GPU.

Conclusions

Single-cell analysis has applications in many areas: oncology, microbiology, neurology, reproduction, immunology, digestive and urinary systems. Hopefully, reduced working time will allow for a much deeper understanding of different cells, paving the way for medical advances that could have great collective benefits. We are working on further refining the the scRNA-seq analysis pipeline. Specifically, our efforts are focused on making further improvements in tSNE, UMAP, and Leiden.

Configuration Details

GCP n1-highmem-64: 1-instance GCP n1-highmem-64: 64 vCPUs (Skylake), 416 GB total memory, bios: Google, ucode: 0x1, Ubuntu 20.04, 5.13.0-1024-gcp

GCP n2-highmem-64: 1-instance GCP n2-highmem-64: 64 vCPUs (Ice Lake), 512 GB total memory, bios: Google, ucode: 0x1, Ubuntu 20.04, 5.13.0-1024-gcp

GCP n2-highcpu-64: 1-instance GCP n2-highcpu-64: 64 vCPUs (Ice Lake), 64 GB total memory, bios: Google, ucode: 0x1, Ubuntu 20.04, 5.13.0-1024-gcp