Introduction to Getting Faster PyTorch* Programs with TorchDynamo

Adam_Wolf · ‎10-02-2024

In the webinar Introduction to Getting Faster PyTorch* Programs with TorchDynamo, presenters, Yuning Qiu and Zaili Wang, introduce the new computational graph capture features in PyTorch* 2.0, with a focus on TorchDynamo and its associated technologies. TorchDynamo is designed to make PyTorch*scripts faster with minimal code changes while maintaining flexibility and ease of use. Important to note is the feature is now referred to by the API name “torch.compile” in the latest PyTorch* documentation, although TorchDynamo was used to represent the whole feature when it was first introduced, which in turn is the terminology adopted for use in this tutorial.

Motivation and Design Principles

PyTorch*, widely adopted by data scientists and researchers for its ease of use and Pythonic philosophy, operates primarily in an “imperative mode” (also known as eager mode). This mode executes user code in a step-by-step manner, allowing for flexibility and easy debugging. However, imperative execution can be suboptimal for large-scale model deployment, where performance gains are often achieved by compiling the model into an optimized computational graph. Traditional approaches in PyTorch*, like TorchScript (JIT) and FX, provide graph compilation but have several limitations, particularly in handling control flow and backward graph optimization. TorchDynamo was developed to address these shortcomings by providing a more seamless graph capture process without compromising PyTorch’s inherent flexibility.

Torch Dynamo: Overview and Key Components

TorchDynamo operates by hooking into Python’s frame evaluation process (enabled by PEP 523) and analyzing Python bytecode at runtime. This allows it to dynamically capture computational graphs during eager mode execution. TorchDynamo is responsible for converting PyTorch* code into an intermediate representation (IR) that can be optimized by a backend compiler, such as TorchInductor. It works alongside several key technologies:

AOTAutograd: Used to simultaneously trace forward and backward computational graphs in an ahead-of-time manner, improving performance for both training and inference. AOTAutograd partitions these graphs into smaller segments, which can then be compiled into efficient machine code.
PrimTorch: Simplifies and reduces the number of operators that backend compilers need to implement by lowering PyTorch’s original operations to a set of around 250 primitive operators. PrimTorch thus enhances the portability and extensibility of compiled PyTorch* models across different hardware platforms.
TorchInductor: The backend compiler responsible for translating the captured computational graphs into optimized machine code. TorchInductor supports optimizations for both CPU and GPU, including Intel’s contributions to CPU inductor and GPU backend optimizations using Triton.

Intel's Contributions to TorchInductor

Intel® has played a pivotal role in enhancing the performance of PyTorch* models on both CPUs and GPUs:

CPU Optimizations: Intel has implemented vectorization using AVX2 and AVX512 instruction sets for over 94% of inference and training kernels in PyTorch* models. This has led to significant performance improvements, with speedups ranging from 1.21x to 3.25x depending on the precision used (e.g., FP32, BF16, or INT8).

GPU Support with Triton: Triton, developed by OpenAI, is a Python-based domain-specific language (DSL) for writing machine learning kernels that run on GPUs. Intel has extended Triton to support their GPU architectures, bridging the gap between Triton’s GPU dialect and Intel’s implementations of SYCL* through the use of SPIR-V IR. This extensibility ensures that Triton can be used to optimize PyTorch* models on Intel GPUs.

Guard Mechanisms and Caching

TorchDynamo introduces a guard mechanism to handle dynamic control flow and minimize the need for recompilation. Guards track the objects referenced in each frame and ensure that the cached graphs are only reused when no changes have occurred in the computation. If a guard detects a change, it triggers a recompilation, breaking the graph into subgraphs if necessary. This minimizes the performance overhead while ensuring the correctness of the compiled graph.

Dynamic Shapes and Scalability

One of the key features of TorchDynamo is support for dynamic shapes. Unlike previous graph-compiling methods, which often struggled with input-dependent control flow or shape variations, TorchDynamo can handle dynamic input shapes without requiring recompilation. This significantly improves the flexibility and scalability of PyTorch* models, making them more adaptable to varying workloads.

Use Cases and Examples

Several practical use cases were highlighted during the webinar to demonstrate the effectiveness of TorchDynamo and TorchInductor. For instance, ResNet50 models trained on Intel CPUs using the Intel Extension for PyTorch* (IPEX) showed considerable performance improvements when optimized with TorchDynamo and TorchInductor. Additionally, Intel’s ongoing work on extending Triton for Intel GPUs promises similar performance gains for models deployed on Intel GPU architectures.

Conclusion

TorchDynamo, along with its associated technologies, represents a significant advancement in PyTorch’s ability to efficiently compile and optimize machine learning models. By seamlessly integrating with Python’s runtime and supporting dynamic shapes, TorchDynamo offers a more flexible and scalable solution compared to earlier approaches like TorchScript and FX. Intel’s contributions, particularly in optimizing performance for both CPUs and GPUs, further enhance the capabilities of this new framework. With ongoing development, TorchDynamo and TorchInductor are poised to become critical tools for researchers and engineers looking to deploy high-performance PyTorch* models in production environments.

We also encourage you to check out Intel’s other AI Tools and framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

About the Speakers

Zaili Wang

AI Software Solutions Engineer

Zaili is an AI Software Solutions Engineer for the AI technical customer engineering team in AIA, DCAI, Intel Corp. He joined Intel in 2022 and his working areas mainly include customer AI workloads enabling and optimization with Intel optimized deep learning frameworks, as well as the evangelism of Intel AI toolkit products. Zaili holds a PhD degree in communication and information engineering from Beijing University of Posts and Telecom.

Yuning Qiu

AI Software Solutions Engineer

Yuning enables customers to build AI applications, providing technical consulting, creating code samples and proofs of concept, and presenting at workshops and webinars focused on using the Intel® AI Analytics Toolkit on Intel® Xeon® Scalable processors and Data Center GPUs to optimize accelerated computing. His background is in machine learning and deep learning. Yuning holds a Bachelor’s Degree in Electronic and Information Engineering from Harbin Institute of Technology, China; Master’s Degree in Electrical and Computer Engineering from Boston University; and Ph.D in Electrical Engineering from the University of Texas at Dallas.