Part 2: Supercharge AI/ML with Intel AMX on AWS EC2 M7i Instances for PyTorch

Mohan_Potheri · ‎10-19-2023

Intel M7i instances on Amazon EC2:

In part 1 of this series, we looked at why Intel XEON processors are excellent compute engines for AI/ML. In this part, we will look at the performance of the Intel Xeon the generation-based Amazon EC2 M7i instances for AI Inference and training on TensorFlow.

Amazon Elastic Compute Cloud (Amazon EC2) M7i-flex and M7i instances[i] represent the forefront of general-purpose computing in the cloud. They are equipped with cutting-edge 4th Generation Intel Xeon Scalable processors, code-named Sapphire Rapids, and boast an impressive 4:1 ratio of memory to virtual CPU (vCPU).

M7i instances deliver exceptional price-to-performance advantages, making them a compelling choice for workloads requiring substantial instance capacities, with capabilities reaching up to 192 vCPUs and 768 GiB of memory. These instances excel in scenarios marked by sustained high CPU utilization and are ideally suited for a range of demanding workloads, such as large-scale application servers, robust databases, gaming servers, CPU-intensive machine learning tasks, and high-quality video streaming. In comparison to M6i[ii] instances, M7i instances deliver a noteworthy up to 15% improvement in price-to-performance metrics.

In this paper, we will compare the performance of Amazon EC2 M7i instances versus the previous generation Amazon EC2 M6i instances for PyTorch based training and inference for a general-purpose AI/ML use case. The latest version of PyTorch 2.0.x combined with Intel PyTorch Extension (IPEX) 2.0.x were used to test performance of a general-purpose dataset for training and inference.

PyTorch 2.x:

Over the past several years, the PyTorch team has consistently pushed the boundaries of innovation, progressing from PyTorch 1.0 to the latest version, PyTorch 1.13. In addition to these advancements, they have transitioned to the newly established PyTorch Foundation, which is now a part of the Linux Foundation.

This latest iteration PyTorch 2[iii], is a potential game-changer in the realm of machine learning (ML) training and development. It not only maintains backward compatibility but also offers a remarkable boost in performance. A simple alteration in your code can yield noticeably faster responses.

The key objectives for PyTorch 2.0 were:

Achieving a 30% or greater enhancement in training speed while reducing memory usage without necessitating any alterations to existing code or workflows.
Streamlining the backend for PyTorch to make it easier to write and manage, reducing the number of operators from over 2000 to around 250.
Delivering state-of-the-art distributed computing capabilities.
Shifting a significant portion of PyTorch's codebase from C++ to Python.

This release is designed to not only enhance performance speed but also introduce support for Dynamic Shapes, enabling the use of tensors of varying sizes without triggering recompilation. These improvements make PyTorch 2 more adaptable, accessible for customization, and lower the entry barrier for developers and vendors.

Intel® Extension for PyTorch:

The Intel® Extension for PyTorch*[iv] enhances PyTorch* by incorporating the latest features and optimizations to deliver enhanced performance specifically tailored for Intel hardware. These optimizations leverage cutting-edge technologies like AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs, as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Furthermore, by utilizing the PyTorch* XPU device, the Intel® Extension for PyTorch* facilitates seamless GPU acceleration for Intel discrete GPUs within the PyTorch* framework.

This extension offers optimization support for both eager mode and graph mode in PyTorch*. However, it's worth noting that in PyTorch*, graph mode typically outperforms eager mode when it comes to optimization techniques like operation fusion. The Intel® Extension for PyTorch* takes this advantage even further by implementing more comprehensive graph optimizations, further amplifying the overall performance benefits.

Real World Dataset used for Testing:

The MIT Indoor scenes dataset was used for the testing. Recognizing indoor scenes poses a formidable challenge within the realm of high-level computer vision. While numerous scene recognition models excel at identifying outdoor environments, their performance tends to falter when applied to indoor settings. The primary hurdle lies in the diverse nature of indoor scenes. For instance, certain indoor scenes like corridors lend themselves to characterization based on overarching spatial attributes, whereas others, such as bookstores, are better defined by the objects they contain. In essence, addressing the indoor scene recognition problem necessitates the development of a model capable of harnessing both local and global discriminative information.

Content-wise, the dataset comprises an array of 67 distinct indoor categories, collectively housing a dataset of 15,620 images. The image count may vary from one category to another, but each category consistently contains a minimum of 100 images. All images are provided in jpg format and are intended exclusively for research purposes.

BF16 versus FP32 Data Type:

BF16 can be thought of as a condensed version of FP32, and with minimal code adjustments, it can seamlessly take the place of FP32 code. Unlike FP16, it doesn't necessitate techniques like loss scaling, which are employed to address underflow issues, alleviating significant challenges for data scientists. Moreover, BF16 empowers data scientists to train more extensive and complex neural network models. With fewer bits to transfer, it demands less throughput and incurs reduced arithmetic complexity, resulting in a reduced silicon area requirement for each experiment. Consequently, BF16 permits data scientists to expand their batch sizes or construct more intricate neural networks. It has emerged as a widely adopted floating-point data type in the data science community.

Intel 4th Gen Xeon Scalable processors supports the use of BF16 for training and inference. We will use this capability available in the Amazon EC2 M7i instances and compare against FP32 from the prior generation while keeping the accuracy constant. The accuracy goal for the training of the dataset was set to greater than 70% to ensure an apples-to-apples comparison.

Testing Infrastructure:

The details of the infrastructure and the components used in the testing are shown below. All aspects of the infrastructure were identical except for the instance types, which were Amazon EC2 M6i and M7i representing Intel Xeon Scalable 3rd and 4th Gen processor categories.

Category	Attribute	m6i	m7i
Run Info	Cumulus Run ID	N/A	N/A
	Benchmark	PyTorch 2.0.100 Training and Inference	PyTorch 2.0.100 Training and Inference
	Intel Extension for PyTorch (IPEX)	2.0.100	2.0.100
	Date	Aug 3-15, 2023	Aug 3-15, 2023
	Test by	Intel	Intel
CSP and VM Config
	Cloud	AWS	AWS
	Region	us-east-1	us-east-1
	Instance Type	m6i.4xlarge	m7i.4xlarge
	CPU(s)	16	16
	Microarchitecture	AWS Nitro	AWS Nitro
	Instance Cost	0.768 USD/hour	0.8065 USD/hour
	Dataset	MIT Indoor Scenes (kaggle.com)	MIT Indoor Scenes (kaggle.com)

Memory
	Memory	64GB	64GB
	DIMM Config
	Memory Capacity / Instance
Network Info
	Network BW / Instance	12.5 Gbps	12.5 Gbps
	NIC Summary
Storage Info
	Storage: NW or Direct Att / Instance	SSD GP2	SSD GP2
	Drive Summary	1 volume 100GB	1 volume 100GB

Table 1: Instance and Benchmark Details for PyTorch

Configuration Details: (Gen to Gen)

The summarized configuration details of the instances used for testing is shown below:

BASELINE: Amazon EC2 M6i, Intel AWS ICX Customized SKU, 16 cores, Memory 64 GB, 12.5 Gbps Network, 100 GB SSD GP2, Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-05-16

NEW: 1-Amazon EC2 M7i, Intel AWS SPR Customized SKU, 16 cores, Memory 64 GB, 12.5 Gbps Network, 100 GB SSD GP2, Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-05-16

Testing:

The tests were performed in August 2023 on M7i and M6i instances in Amazon region us-east-2. The same configuration was used for both training and inference testing. The goal was to compare the raw AI performance for PyTorch for training and inference with a real-world indoor image dataset such as the MIT Indoor scenes dataset. Details of the software and workload are shown in Table 2.

Category	Attribute	m6i	m7i
Run Info
	Benchmark	PyTorch 2.0.1 Training and Inference	PyTorch 2.0.1 Training and Inference
	Dates	Jun 22-Aug 15, 2023	Aug 3-15, 2023
	Test by	Intel	Intel
Software
	Workload
Workload Specific Details	Dataset	MIT Indoor Scenes	MIT Indoor Scenes
	Command Line	*# Training:* python3 /mnt/data/pyimagesearch-sample/training-all-combos.py --out_file_name="m6i-csv"--bfloat16=True --channels_last=True --extension=True –epochs=10 # ran the same command with all possible combinations (8 in total) of the 3 parameters, namely bfloat16, channels_last and extension. Thus 8 models were created in total. *# Inference:* python /mnt/data/pyimagesearch-sample/inference-all-combos.py --model_file=“<m6i-model-name>” --bfloat16=True --channels_last=True --extension=True --out_file_name=“m6i-out.csv” # ran batch inference with the 8 models produced above, with the same values for the parameters bfloat16, channels_last and extension as were used while building the corresponding model during training.	*# Training:* python3 /mnt/data/pyimagesearch-sample/training-all-combos.py --out_file_name="m7i-csv"--bfloat16=True --channels_last=True --extension=True –epochs=10 # ran the same command with all possible combinations (8 in total) of the 3 parameters, namely bfloat16, channels_last and extension. Thus 8 models were created in total. *# Inference:* python /mnt/data/pyimagesearch-sample/inference-all-combos.py --model_file=“<m7i-model-name>” --bfloat16=True --channels_last=True --extension=True --out_file_name=“m7i-out.csv” # ran batch inference with the 8 models produced above, with the same values for the parameters bfloat16, channels_last and extension as were used while building the corresponding model during training.

Table 2: PyTorch Test run configuration

PyTorch Inference:

Inference is the process of running data points into a trained model to calculate an output such as a single numerical score. This inferencing process is referred to as operationalizing a machine learning model and putting the model into production. The PyTorch platform can be effectively used for deploying trained models in production.

We tested PyTorch inference performance using images processed per second as the metric.

Data Type/Instance	Relative Performance	Avg images/sec
FP32--M6i	1	48
FP32--M7i	1.09	53
BF16--M7i	1.32	64

Table 3: Raw Data from PyTorch Inference testing

The chart comparing the results is shown in Figure 1 below. The performance measures images processed/second across the three workloads and larger is better in the chart. Performance for BF16 and FP32 are compared for Amazon EC2 M7i against FP32 on Amazon EC2 M6i instances.

Figure 1: Inference performance with average images processed/second across instance types.

The relative difference in processing speed between the different instances and datatypes was also charted as a percentage difference in performance as shown in Figure 2.

Figure 2: Relative image processing speedup

PyTorch Inference Results:

The results clearly show that PyTorch inference with Amazon EC2 M7i is 9% faster for same data type and 32% faster when BF16 is leveraged. AWS users can benefit by moving their inference workloads to M7i for PyTorch and leveraging the Intel PyTorch Extensions for optimized performance.

PyTorch Training:

PyTorch provides a deep learning tensor library rooted in Python and Torch, designed primarily for harnessing the power of GPUs and CPUs. What sets PyTorch apart from other deep learning frameworks like TensorFlow and Keras is its utilization of dynamic computation graphs and its strong adherence to Pythonic principles. PyTorch empowers data scientists, developers, and AI professionals to execute and evaluate specific segments of code in real-time.

We tested PyTorch training performance using time taken to train the model to a fixed accuracy as the metric.

Data Type/Instance	Relative Performance	Training Time (seconds)
FP32-M6i	1	2199
FP32--M7i	1.25	1760
BF16--M7i	2.53	868

Table 4: Raw Data from PyTorch Training

Figure 3: PyTorch Training performance across instance types

Figure 4: Relative Training Image Processing Speed

PyTorch training Results:

The results clearly show that PyTorch training with Amazon EC2 M7i is 1.25x faster for same data type and 2.5x faster when BF16 is leveraged. AWS users can benefit significantly by moving their small to medium sized real life training workloads to M7i with PyTorch and leveraging the Intel PyTorch Extensions for optimized performance.

Conclusion:

Intel 4th Gen Xeon Scalable processors with AMX offers improved AI/ML performance. Amazon EC2 M7i instances for PyTorch with Intel PyTorch extensions offers significant performance advantages as the results show. The goal of this project is to compare Intel ICX based M6i and SPR M7i based instances on AWS for AI/ML training and performance.

The results clearly show that Amazon EC2 M7i instances outperform the previous generation M6i instances with 30% higher throughput for AI inference with BF16 on M7i versus FP32 on M6i and 2.5X better training performance with BF16 on M7i versus FP32 on M6i for identical accuracy.

In the final part 3 of this series, we will look at the use of Amazon EC2 M7i instances for distributed AI/ML training.

Disclosure text:

Tests were performed September 2023 on AWS in region us-east-1. Full configuration details are shown in table 1 and 2. Individual Amazon EC2 m6i.4xlarge and m7i.4xlarge instances were used in with the following configuration:

The m6i instance features the 3rd generation Xeon Scalable processors, while the m7i features the 4th generation Xeon Scalable processors.

Instance Size	Physical Cores	Memory (GiB)
m6i.4xlarge	16	64
m7i.xlarge	16	64

Notices & Disclaimers:

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary. For further information please refer to Legal Notices and Disclaimers.

Intel technologies may require enabled hardware, software, or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

References

[i] https://aws.amazon.com/ec2/instance-types/m7i/ Amazon EC2 m7i instances are next-generation general purpose instances powered by custom 4th Generation Intel Xeon Scalable processors (code named Sapphire Rapids) and feature a 4:1 ratio of memory to vCPU. EC2 instances powered by these custom processors, available only on AWS, offer the best performance among comparable Intel processors in the cloud – up to 15% better performance than Intel processors utilized by other cloud providers.

[ii] https://aws.amazon.com/ec2/instance-types/m6i/ Amazon Elastic Compute Cloud (EC2) M6i instances, powered by 3rd Generation Intel Xeon Scalable processors, deliver up to 15% better price performance compared to M5 instances. M6i instances feature a 4:1 ratio of memory to vCPU similar to M5 instances, and support up to 128 vCPUs per instance, which is 33% more than M5 instances. The M6i instances are SAP Certified and ideal for workloads such as backend servers supporting enterprise applications (such as Microsoft Exchange and SharePoint, SAP Business Suite, MySQL, Microsoft SQL Server, and PostgreSQL databases), gaming servers, caching fleets, and application development environments.

[iii] https://pytorch.org/get-started/pytorch-2.0/ PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood. We are able to provide faster performance and support for Dynamic Shapes and Distributed.

[iv] https://github.com/intel/intel-extension-for-pytorch A Python package for extending the official PyTorch that can easily obtain performance on Intel platform