In part 2 of the blog series, we compared the performance of Amazon EC2 M7i instances versus the previous generation Amazon EC2 M6i instances for PyTorch based training and inference for a general-purpose AI/ML use case. In this part 3, we will look at leveraging Amazon EC2 M7i instances with PyTorch to scale training with distributed AI/ML.
AI training is a fundamental process in the development of artificial intelligence systems, such as neural networks and machine learning models. It involves teaching an AI system to recognize patterns, make predictions, or perform tasks by exposing it to large amounts of data and adjusting its internal parameters through iterative optimization techniques. AI training is important because it underpins the capabilities of AI systems, allowing them to process and understand data, make intelligent decisions, and perform tasks across a wide range of applications, ultimately transforming industries and improving our daily lives.
AI training requires significant computational resources, making it imperative to address the ever-growing demand for compute power. As AI models become larger and more complex, their training processes demand more processing power and memory. This escalating compute requirement can strain individual hardware setups, leading to prohibitive costs and lengthy training times. To overcome these challenges, distributed training has emerged as a solution.
GPUs are employed to consolidate significant computing power within a single server, but their cost and availability can be limiting factors. Distributed AI training using Intel Xeon CPUs on the AWS cloud delivers a cost-efficient solution to the resource-intensive demands of conventional AI training. In a study recently conducted, we explored an effective way to achieve scalability by harnessing AI nodes in a distributed framework designed for training large models. Including in this post is a deep dive into the study and the results achieved, including a significant reduction in training time.
Introduction to Distributed AI Training:
Artificial Intelligence (AI) is a transformative force in multiple industries, changing problem-solving, prediction, and automation. Machine learning, a subset of AI, has progressed significantly due to deep learning and vast datasets. Effective AI models depend on resource-intensive training, leading to the emergence of Distributed AI Training. This approach addresses scalability and efficiency challenges by distributing tasks across multiple devices, enhancing training speed and model sophistication. Distributed AI Training plays a vital role in improving various AI applications, making AI more powerful and accessible in today's data-rich and complex model landscape.
Distributed AI training accelerates the training of large and complex AI models by distributing the workload across multiple machines, enabling parallel processing. Two primary approaches are data parallelism, where training data is divided into batches for individual machines to train their model copies, and model parallelism, where the model is split into sections assigned to different machines for parallel training. After training, the machines communicate and update the global model parameters. Implementing distributed AI training can be challenging, but it provides substantial performance improvements when training intricate AI models.
Figure 1: Types of distributed AI training[i]
Benefits of Distributed AI Training
There are several benefits to using distributed AI training:
- Faster training: Distributed AI training can significantly reduce the time it takes to train large and complex AI models.
- Scalability: Distributed AI training can be scaled to train models on very large datasets.
- Cost-effectiveness: Distributed AI training can be more cost-effective than training models on a single machine, especially for large models.
4th Gen Intel(R) Xeon(R) processors for Distributed AI Training:
The 4th Gen Intel(R) Xeon(R) processors (previously codenamed Sapphire Rapids) are well-suited for distributed AI training because they offer many advantages, including:
- High performance: The latest generation processors offer significant performance improvements over previous generations, thanks to the new architecture and advanced features, making them an ideal choice for training large and complex AI models.
- Scalability: The 4th generation Intel Xeon Scalable processors can be scaled to meet the needs of any training workload, from small research projects to large production deployments. They can be used to build clusters of hundreds or even thousands of machines, which can be used to train the largest and most complex AI models.
- Cost-effectiveness: The 4th generation Intel Xeon Scalable processors are a cost-effective solution for distributed AI training. They offer a good balance of performance and price, and they are supported by a wide range of software and hardware vendors.
- Intel Optimizations: Intel provides a suite of software optimization tools, such as Intel(R) oneAPI Toolkit and Intel Distribution for Python, that further enhance the performance of distributed AI training on Intel Xeon processors.
- Memory Capacity: Intel Xeon processors support large memory capacities, enabling efficient handling of massive datasets used in distributed AI training.
While the advantages above are significant, the 4th Gen Intel Xeon processors also offer advanced features ideal for distributed AI training including:
- Intel Advanced Matrix Extensions (Intel® AMX): Intel AMX is a new instruction set that accelerates matrix multiplication and other operations that are commonly used in AI training. This can lead to significant performance improvements for AI training workloads.
- Intel® In-Memory Analytics Accelerator (Intel® IAA): Intel IAA is a new hardware accelerator that can improve the performance of memory-intensive workloads, such as AI training workloads.
- Intel® Deep Learning Boost (Intel® DL Boost): Intel DL Boost is a suite of technologies that accelerate deep learning workloads on Intel Xeon Scalable processors. This includes support for popular deep learning frameworks, such as TensorFlow, PyTorch, and MXNet.
Overall, the 4th generation Intel Xeon Scalable processors are a great choice for distributed AI training because they offer high performance, scalability, cost-effectiveness, and several features that can be specifically beneficial for distributed AI training.
Intel M7i instances on Amazon EC2:
Amazon Elastic Compute Cloud (Amazon EC2) M7i-flex and M7i instances[ii] represent the forefront of general-purpose computing in the cloud. They are equipped with cutting-edge 4th Generation Intel Xeon Scalable processors and boast an impressive 4:1 ratio of memory to virtual CPU (vCPU).
M7i instances offer exceptional flexibility making them a compelling choice for workloads requiring substantial instance capacities, with capabilities reaching up to 192 vCPUs and 768 GiB of memory. These instances excel in scenarios marked by sustained high CPU utilization and are ideally suited for a range of demanding workloads, such as CPU-intensive machine learning tasks. In comparison to M6i[iii] instances, M7i instances deliver a noteworthy up to 15% improvement in price-to-performance.
In this blog, we’ll look at the scalability of distributed AI training with Amazon EC2 M7i instances used as the building blocks.
PyTorch 2.x:
Over the past several years, the PyTorch team has consistently pushed the boundaries of innovation, progressing from PyTorch 1.0 to the latest version, PyTorch 1.13. In addition to these advancements, they have transitioned to the newly established PyTorch Foundation, which is now a part of the Linux Foundation.
This latest iteration PyTorch 2[iv], is a potential game-changer in the realm of machine learning (ML) training and development. It not only maintains backward compatibility but also offers a remarkable boost in performance. A simple alteration in your code can yield noticeably faster responses.
The key objectives for PyTorch 2.0 were:
- Achieving a 30% or greater enhancement in training speed while reducing memory usage without necessitating any alterations to existing code or workflows.
- Streamlining the backend for PyTorch to make it easier to write and manage, reducing the number of operators from over 2000 to around 250.
- Delivering state-of-the-art distributed computing capabilities.
- Shifting a significant portion of PyTorch's codebase from C++ to Python.
This release is designed to not only enhance performance speed but also introduce support for Dynamic Shapes, enabling the use of tensors of varying sizes without triggering recompilation. These improvements make PyTorch 2 more adaptable, accessible for customization, and lower the entry barrier for developers and vendors.
Hugging Face Accelerate:
Hugging Face Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! Hugging Face Accelerate does training and inference at scale made simple, efficient, and adaptable. It takes care of the heavy lifting, without need to adapt code across platforms. It also helps convert existing codebases to utilize DeepSpeed to perform fully sharded data parallelism, with automatic support for mixed-precision training! [v] 1
Testing Infrastructure:
The details of the infrastructure and the components used in the testing are shown below. All aspects of the infrastructure were identical except for the instance types, which was Amazon EC2 M7i representing 4th Generation Intel Xeon Scalable processor categories.
Category | Attribute | M7i |
| Cumulus Run ID | N/A |
| Benchmark | Distributed training using Hugging Face accelerate [vi]and PyTorch 2.0.1 |
| Date | October, 2023 |
| Test by | Intel |
| Cloud | AWS |
| Region | us-east-1 |
| Instance Type | m7i.4xlarge |
CSP Config | CPU(s) | 8 |
| Microarchitecture | AWS Nitro |
| Instance Cost | 0.714 USD/hour |
| Number of Instances or VMs (if cluster) | 1-8 |
Memory | RAM | 32GB |
Network Info | Network BW / Instance | 12.5 Gbps |
Storage Info | Storage: NW or Direct Att / Instance | SSD GP2 1 Volume 70 GB |
Dates | October, 2023 | |
Tested By | Intel | |
| Command Line | # Distributed Training – following example is to run over 8 nodes: mpirun -f hostfile -n 8 -ppn 1 accelerate launch --config_file /home/ubuntu/default_config.yaml --num_cpu_threads_per_process 16 /mnt/data/dist_cpu/trg_img_clsf.py --train_dir /mnt/data/pyimagesearch-sample/indoor-scenes/dataset//train --validation_dir /mnt/data/pyimagesearch-sample/indoor-scenes/dataset//val --num_train_epochs 8 --per_device_train_batch_size 256 --per_device_eval_batch_size 256 --output_dir model_output --cache_dir /tmp/cachedir4accelerate --channels_last --with_tracking --ignore_mismatched_sizes --model_name_or_path google/vit-base-patch16-224-in21k |
Table 1: Instance and Benchmark Details for Distributed training
Configuration Details: (M7i)
The summarized configuration details of the instance used for testing is shown below:
Config: 1-Amazon EC2 M7i, Intel AWS SPR Customized SKU, 16 cores, Memory 64 GB, 12.5 Gbps Network, 100 GB SSD GP2, Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-05-16.
Testing:
The tests were performed in October 2023 on M7i instances in Amazon region us-east-1. The goal was to compare EPOCH (training steps) for 1, 2, 4 and 8 nodes of distributed configuration. Distributed training using Hugging Face accelerate and PyTorch 2.0.1. The hardware, software and workload configuration are shown in Table 1.
The number of nodes in the cluster were modified and the same AI training job was run. Epochs represent different stages in the training. The Epoch times were measured for the different node configurations and the data tabulated as shown in Table 2.
Number of Training Instance nodes | Time taken to do 8 epochs of training in minutes (lower is better) |
1 | 110 |
2 | 57 |
4 | 30 |
8 | 15 |
Table 2: Average Epoch times for clusters of different sizes
Results:
The data with the epoch times by cluster was then plotted to visualize the scalability of the distributed training experiment. Figure 1 clearly shows that the distributed solution scales well with number of nodes with a small amount of degradation as expected.
Figure 2: Epoch time graph for different cluster sizes
In a perfect world, four nodes will be twice as fast as two nodes but there is always some overhead associated with distributed processing. The graph above shows that the solution scales almost linearly with little loss as more and more nodes are added. The Epoch time reduces as the number of nodes increase leading to faster training time for the model. Distributed training can be leveraged to meet training SLAs when single node training does not suffice. Large models that need more compute power for training than what will fit in a single node or virtual machine can be trained by adding additional nodes to increase the compute capacity.
Conclusion:
Distributed AI training stands as a pivotal advancement in the realm of artificial intelligence, offering scalability and versatility that can revolutionize industries. Its capacity to harness the collective power of multiple hardware resources accelerates AI model development and enables the tackling of ever more complex challenges. This approach has found utility across diverse industries, from healthcare and finance to autonomous vehicles and natural language processing, enhancing decision-making, automation, and innovation. As the demand for AI capabilities continues to grow, distributed training not only meets the computational requirements but also represents a beacon of progress, propelling us into a future where AI systems play an increasingly central role in shaping the world, we live in.
Distributed AI training with Intel 4th Gen Xeon scalable processors featured in Amazon EC2 M7i offers a powerful, scalable, and cost-effective solution for training large and complex AI models. In a previous blog, we showed the efficacy of using AMX with Amazon EC2 M7i for training. Here we have shown that AWS customers can effectively leverage the latest Intel Xeon hardware and accelerators such as AMX in a distributed manner to meet their training needs.
References:
[i] https://www.anyscale.com/blog/what-is-distributed-training: What is distributed training? The goal is to use low-cost infrastructure in a clustered environment to parallelize training models.
[ii] https://aws.amazon.com/ec2/instance-types/m7i/ Amazon EC2 m7i instances are next-generation general purpose instances powered by custom 4th Generation Intel Xeon Scalable processors (code named Sapphire Rapids) and feature a 4:1 ratio of memory to vCPU. EC2 instances powered by these custom processors, available only on AWS, offer the best performance among comparable Intel processors in the cloud – up to 15% better performance than Intel processors utilized by other cloud providers.
[iii] https://aws.amazon.com/ec2/instance-types/m6i/ Amazon Elastic Compute Cloud (EC2) M6i instances, powered by 3rd Generation Intel Xeon Scalable processors, deliver up to 15% better price performance compared to M5 instances. M6i instances feature a 4:1 ratio of memory to vCPU similar to M5 instances, and support up to 128 vCPUs per instance, which is 33% more than M5 instances. The M6i instances are SAP Certified and ideal for workloads such as backend servers supporting enterprise applications (such as Microsoft Exchange and SharePoint, SAP Business Suite, MySQL, Microsoft SQL Server, and PostgreSQL databases), gaming servers, caching fleets, and application development environments.
[iv] https://pytorch.org/get-started/pytorch-2.0/ PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood. We are able to provide faster performance and support for Dynamic Shapes and Distributed.
[v] https://huggingface.co/docs/transformers/accelerate As models get bigger, parallelism has emerged as a strategy for training larger models on limited hardware and accelerating training speed by several orders of magnitude. At Hugging Face, we created the
[vi] https://huggingface.co/docs/transformers/accelerate As models get bigger, parallelism has emerged as a strategy for training larger models on limited hardware and accelerating training speed by several orders of magnitude. At Hugging Face, we created the
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.