Intel Xeon is all you need for AI inference: Performance Leadership on Real World Applications

Sanchit_Misra · ‎07-19-2023

Sanchit Misra is a senior research scientist and leads the efforts in computational biology/HPC research at Intel Labs.

Highlights:

Intel is democratizing AI inference by delivering a better price and performance for
real-world use cases on the 4th gen Intel® Xeon® Scalable Processors, formerly codenamed Sapphire Rapids. In this article, Intel® CPU refers to 4th gen Intel® Xeon® Scalable Processors.
For protein folding of a set of proteins of lengths less than a thousand, using DeepMind’s AlphaFold2 inference based end-to-end pipeline, a dual socket Intel® CPU node delivers 30% better performance compared to an Intel® CPU with an A100 offload.
For Google’s DeepVariant pipeline, 8 nodes of dual socket Intel® CPUs outperform a DGX A100 GPU system with 8 A100 GPUs running Nvidia Clara Parabricks by nearly 1.9 times.
Intel’s version of the above two pipelines is available through Intel Open Omics Acceleration Framework for the community to replicate these results.

Performance benchmarks are often used as a measurement of progress in a field. However, they do not always tell the full story; one individual artificial intelligence benchmark is only a small consideration when it comes to real-world use cases. For example, DeepMind’s AlphaFold2 has achieved near experimental accuracy for prediction of protein 3D structures. Prior to AlphaFold2, the structures of only about 190 thousand proteins were known, but thanks to AlphaFold2, DeepMind has released the structures of nearly 200 million proteins. This represents one of the most revolutionary advances in applying AI to real world problems for better human health and paves the way for use of generative AI for protein design and drug discovery. The deep learning component of AlphaFold2 is essential for its prediction accuracy, but it takes up only a little fraction of the total execution time of the end-to-end pipeline for protein folding.

Much like a decathlete needs to excel on a wide range of skills – not just the 100m dash – to perform competitively in a decathlon, all parts of an application need to be accelerated to deliver end-to-end performance and value. Real world AI inference is done through end-to-end pipelines that typically start with the data on disk, performs domain specific processing of the data and prepares it for the deep learning module, runs the deep learning module, performs further domain specific processing, and then writes the output to disk. Therefore, these pipelines comprise of deep learning (DL) based compute and non-DL compute with varied computational characteristics. Recently, GPU has been positioned as the platform of choice for Deep Learning with performance demonstrated on benchmarks. However, when it comes to AI inference in
end-to-end pipelines, CPUs perform well on a much wider range of compute characteristics resulting in better overall performance.

Applying this to the real world, Intel is disrupting the industry and democratizing AI by delivering a better performance and lower price on CPU. The 4th gen Intel® Xeon® Scalable Processor, formerly codenamed Sapphire Rapids, is a more balanced platform for AI inference with its 1) larger cache that helps with data locality, 2) higher core frequency, multiple scalar ports and out-of-order execution that helps accelerate compute that is single threaded or multi-threaded but scalar, 3) Intel® Advanced Vector Extensions 512 (Intel® AVX-512) that helps with non-DL vector compute, 4) Intel® Advanced Matrix Extensions (Intel® AMX) that is built-in hardware support for AI acceleration, and 5) large memory capacity that allows to solve larger problems. Through this processor, Intel delivers the AI of tomorrow on the platform that you have today by enhancing deep learning inference pipelines end to end. We demonstrate this by setting new industry records for two of the most popular end-to-end AI-inference pipelines in digital biology: AlphaFold2 for protein folding and DeepVariant for variant calling.

Protein Folding with AlphaFold2

The protein folding problem is considered a holy grail problem in biology, a task that entails predicting the 3D structure of a protein from its amino acid sequence. Accurate protein structure prediction is vital in biology as a protein’s structure governs its function. It also has significant implications for drug discovery, biotechnology, and understanding the mechanisms of diseases.

Protein folding pipeline using AlphaFold2 consists of two parts: i) preprocessing that includes database search using file IO and alignment of multiple protein sequences and ii) model inference that performs inference using a transformer-based deep learning model. We accelerated this pipeline using a 4th gen Intel Xeon scalable processor through use of Intel AMX with bfloat16 precision for DL compute, Intel AVX-512 for non-DL compute and cache optimizations. Our accelerated implementation is open sourced through the Intel Open Omics Acceleration Framework.

To demonstrate the performance benefits of Intel CPUs over GPUs, we benchmarked our Open Omics version of the pipeline against the fastest known GPU implementation, FastFold v. 0.2.0, on the four platforms mentioned in the following table.

Platform Name	CPU	GPU
1 GCP a2-highgpu-1g	6 cores (12 vCPUs)	1 Nvidia 40 GB A100 GPU
1 CPU socket	56 core Intel® Xeon® Platinum 8480+ processor	None
1 CPU socket + 1 GPU	56 core Intel® Xeon® Platinum 8480+ processor	1 Nvidia 40 GB A100 GPU
2 CPU sockets	2 sockets of 56 core Intel® Xeon® Platinum 8480+ processor	None

Table 1: The platforms used for benchmarking.

FastFold uses tensor cores with bfloat16 precision. For all public implementations of AlphaFold2 (including FastFold), GPUs depend on CPU to accelerate pre-processing.

alphafold-upto-1K-3.8K.png

Figure 1: The two charts compare the performance of the four platforms mentioned in Table 1 for set of C. elegans proteins of length less than 1000 (top chart) and proteins of length 1000-3800 (bottom chart). All the experiments use bfloat16 for model inference. Each chart has four stacked bars with execution time: i) measured[1] FastFold model inference on the A100 of GCP A100 instance and FastFold preprocessing on the CPU host (12 vCPUs) of the GCP A100 instance; ii) measured[1] Open Omics preprocessing and model inference on 1 CPU socket; iii) obtained by adding model inference time of bar 1 with preprocessing time of bar 2 – FastFold model inference on A100 and the superior Open Omics preprocessing on 1 CPU socket, representing the best-case performance using 1 CPU socket and 1 GPU; iv) measured[1] Open Omics preprocessing and model inference on 2 CPU sockets (1 dual-socket CPU node). Results may vary. See backup for configurations.

As shown in the chart above, model inference consumes a fraction of the execution time of the protein folding pipeline on GCP A100 instance. For a set of proteins of lengths less than a thousand, the Open Omics AlphaFold2 on a single Intel CPU achieves 9.5 times faster performance compared to GPU-based FastFold on an A100 GCP instance. This is mainly because preprocessing consumes majority of the time, and it is done using only 6 cores on the GCP instance compared to 56 cores on the CPU socket.

Intel’s AlphaFold2 on dual-socket Intel CPU beats the best-case CPU-GPU implementation on 1 A100 + 1 Intel CPU by 30%. This is because when you add one A100 to a CPU socket, only model inference scales, but when you add another CPU socket, both model inference and preprocessing scale.

Even though most protein lengths are less than a thousand, we benchmarked for large proteins (lengths up to 3800) as well, and the throughput superiority of CPU continues to hold. For proteins of length larger than 3800, the 40 GB A100 GPU ran out of memory, while the Intel Xeon scalable processor was successfully able to fold proteins of length 9000 and above. These results provide a strong argument in favor of establishing Intel CPU as the platform of choice for end-to-end deep learning-based inference pipelines.

DeepVariant for Variant Calling

The rate of DNA sequencing data generation is advancing at a dramatic pace; however, the rapid rise in sequencing throughput demands a commensurate increase in the rate of analysis of the sequence data. Variant Calling (VC) is a fundamental task in sequence analysis. Given the sequencing reads from an individual’s genome, VC identifies different kinds of variations in the genome against a given reference genome. In 2017, Google proposed DeepVariant, a deep learning-based germline variant caller, which immediately established itself as a highly accurate state-of-the-art variant caller, leading to high popularity and production use in many genomics studies.

The variant calling pipeline using DeepVariant consists of the following two parts: i) model inference that performs inference using Inception V3 deep learning model, and ii) rest that includes file IO, data processing tasks, such as, indexing, searching, aligning, and sorting to prepare data for inference and post processing of inference output. We accelerated this pipeline using 4th gen Intel Xeon Scalable processors through use of Intel AMX with bfloat16 precision for DL compute, Intel AVX-512 for non-DL compute, cache optimizations and scaling it to multiple CPU sockets. Our accelerated implementation is open sourced through the Intel Open Omics Acceleration Framework.

We benchmarked the Open Omics version of the pipeline to demonstrate the performance benefits of CPUs over GPUs. We use the standard 30x coverage whole genome sequencing short reads dataset that is typically used for such benchmarking exercises. For GPU performance, we use the best-ever reported run time that was recently claimed by Nvidia. It reported that Nvidia Clara Parabricks on a DGX A100 system with 8 A100 GPUs can perform variant calling on the standard dataset in just 16 mins.

As illustrated in the following Figure, Open Omics consumes just 109 mins on a single socket of the Intel CPU. Moreover, it consumes just 8.5 mins on 8 dual-socket Intel CPUs (16 sockets), which is nearly 1.9 times faster than a DGX A100 system with 8 A100 GPUs. On further scaling to 32 dual-socket Intel CPUs (64 sockets), Open Omics consumes just 3 mins – significantly pushing the envelope on run time with 5.3 times better latency than any published performance claim for this task to date.

Figure 2. Execution time of three implementations for DeepVariant germline pipeline on WGS 30x human (HG001) short read dataset. CPU socket refers to one Intel® Xeon® Platinum 8480+ processor. The first bar shows the measured[2] performance of the baseline CPU implementation on a single CPU socket. The next few bars show the measured[3] performance of Open Omics optimized implementation as we increase the number of CPU sockets. The last green bar shows the performance reported by Nvidia[4]. Results may vary. See backup for configurations.

The results demonstrate that not only does computing on 8 CPU nodes cost less time than the 8 Nvidia A100 DGX system, but this is yet another example of an end-to-end pipeline in which the dense GPU-friendly compute is limited. Only about 25% of the pipeline is spent on deep learning based dense compute, while the rest consists of irregular compute and memory access, making it more amenable to the balanced CPU architecture.

Looking to the Future

With the open-sourced release of AlphaFold2 and DeepVariant optimized for Intel Xeon Scalable Processors, we have demonstrated that infusion of AI inference in real world applications requires taking a holistic end-to-end view of the overall application. Therefore, a benchmark performance may not be indicative of performance on entire applications. Accelerating such applications requires a balanced compute platform that accelerated both DL and non-DL compute. Furthermore, with our Intel Open Omics Acceleration Framework, we seek to provide global community with an easy to use, high performance framework to bring benefits of AI for Life Sciences to every person on the planet.

Configuration Details

For AlphaFold2 pipeline:

BASELINE (GPU) on GCP A100 instance (GCP a2-highgpu-1g): Test by Intel as of 05/21/23. 1-instance, 12 vCPUs (Cascade Lake), 85 GB total memory, 40 GB A100 GPU, bios: Google, ucode: 0x1, Ubuntu 20.04, 5.13.0-1024-gcp, PyTorch - v1.12.1, Fastfold - v 0.2.0, Hmmer - v3.3.2, hh-suite - v3.3.0, Kalign2 – v2.04, framework version: PyTorch - v1.12.1, model name & version: AlphaFold2

Open Omics on 1 CPU socket: Test by Intel as of 05/21/23. 1-socket, 1x Intel® Xeon® Platinum 8480+, 56 cores, HT On, Turbo On, Total Memory 1024 GB (16 slots/ 64 GB/ DDR5 4800 MT/s [4800 MT/s]), bios: SE5C7411.86B.9525.D13.2302071332, ucode version: 0x2b000190, OS Version: Rocky Linux 8.7 (Green Obsidian), kernel version: 4.18.0-372.32.1.el8_6.crt2.x86_64, compiler version: g++ 9.4.0, workload version: Intel-python - 2022.1.0 JAX - v0.4.8, AlphaFold2, - v2.0, Hmmer (Our optimizations over v3.3.2), hh-suite (Our optimizations over v3.3.0), Kalign2 – v2.04, framework version: PyTorch - v1.11.0, model name & version: AlphaFold2

Open Omics on 2 CPU sockets: Test by Intel as of 05/21/23. 1-node, 2x Intel® Xeon® Platinum 8480+, 56 cores, HT On, Turbo On, Total Memory 1024 GB (16 slots/ 64 GB/ DDR5 4800 MT/s [4800 MT/s]), bios: SE5C7411.86B.9525.D13.2302071332, ucode version: 0x2b000190, OS Version: Rocky Linux 8.7 (Green Obsidian), kernel version: 4.18.0-372.32.1.el8_6.crt2.x86_64, compiler version: g++ 9.4.0, workload version: Intel-python - 2022.1.0 JAX - v0.4.8, AlphaFold2, - v2.0, Hmmer (Our optimizations over v3.3.2), hh-suite (Our optimizations over v3.3.0), Kalign2 – v2.04, framework version: PyTorch - v1.11.0, model name & version: AlphaFold2

Open Omics Preprocessing on 1 CPU socket + FastFold Model inference on 1 A100 for best CPU-GPU performance: Since preprocessing and model inference can be run independently, the time for this case was obtained by adding preprocessing on 1 CPU socket using Open Omics with model inference on GCP A100 instance using FastFold.

For DeepVariant pipeline:

BASELINE on CPU: Test by Intel as of 06/15/23. 1-socket, 1x Intel® Xeon® Platinum 8480+, 56 cores, HT On, Turbo On, Total Memory 256 GB (8 slots/ 32 GB/ DDR5 4800 MT/s [4800 MT/s]), bios: SE5C7411.86B.9525.D13.2302071332, ucode version: 0x2b000190, OS Version: Rocky Linux 8.7 (Green Obsidian), kernel version: 4.18.0-372.32.1.el8_6.crt2.x86_64, compiler version: g++ 9.4.0, workload version: bwa-mem v0.7.17, Samtools v. 1.16.1, DeepVariant v1.5, framework version: Intel-tensorflow 2.11.0, model name & version: Inception V3

Open Omics on CPU: Test by Intel as of 06/10/23. 1-node (1,2 socket), 2/4/8/16/32-nodes (4/8/16/32/64 sockets), Each socket is 1x Intel® Xeon® Platinum 8480+, 56 cores, HT On, Turbo On, Total Memory 256 GB (8 slots/ 32 GB/ DDR5 4800 MT/s [4800 MT/s]), bios: SE5C7411.86B.9525.D13.2302071332, ucode version: 0x2b000190, OS Version: Rocky Linux 8.7 (Green Obsidian), kernel version: 4.18.0-372.32.1.el8_6.crt2.x86_64, compiler version: g++ 9.4.0, workload version: bwa-mem2 v2.2.1, Samtools v. 1.16.1, Our optimized version of DeepVariant v1.5, framework version: Intel-tensorflow 2.11.0, model name & version: Inception V3

Nvidia Clara Parabricks on DGX A100 box: Performance reported by Nvidia on 03/21/2023 on DGX A100 box running Nvidia Clara parabricks; reported here: https://developer.nvidia.com/blog/long-read-sequencing-workflows-and-higher-throughputs-in-nvidia-parabricks-4-1/

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

[1] Test by Intel as of May 21, 2023

[2] Test by Intel as of June 15, 2023

[3] Test by Intel as of June 10, 2023

[4] As mentioned on this link on Mar 21, 2023: https://developer.nvidia.com/blog/long-read-sequencing-workflows-and-higher-throughputs-in-nvidia-parabricks-4-1/

BenOlopa · ‎07-21-2023

This comparison is incorrect.

Your number for Parabricks (16 mins) is the runtime for their germline pipeline: BWA-MEM, sort, markdups, BQSR, HaploptypeCaller.

DeepVariant runs in Parabricks in a couple of minutes on a DGX A100.

Sanchit_Misra · ‎07-21-2023

@BenOlopa : Our results are also for germline pipeline: BWA-MEM => sort => DeepVariant. Please note the caption in Figure 2. These are not results of just the DeepVariant step.