PyTorch Enhancements for Accelerator Abstraction

AnkurNeog · ‎12-19-2024

Introduction

The rapid rise of artificial intelligence (AI) applications, particularly large language models (LLMs), has revolutionized numerous industries. LLMs, which use deep learning techniques to process and generate human-like text, have become integral in applications such as chatbots, sentiment analysis and automated content creation. This surge in AI capabilities has created an unprecedented demand for computational power, as training and deploying these sophisticated models requires significant resources. The exponential growth in compute need is outpacing the available supply, leading to concerns of a potential bottleneck in AI development. Nvidia's CUDA software platform has set up a stronghold in AI acceleration market commanding dominant market share for AI chips. Competitors face significant hurdles due to Nvidia's established ecosystem and the inertia created by widespread adoption of CUDA based workflows. The dominance of CUDA in open-source AI frameworks, particularly in PyTorch, stems from its early establishment as a robust platform for GPU computing.

Support for diverse hardware backends is being added in PyTorch over the years, however, native support is restricted to CPU, CUDA, and META devices. The support for other accelerators is achieved through out of tree extensions such as Google’s TPU/XLA, Intel’s Gaudi/HPU and Huawei’s NPU as well as in-tree additions such as Apple devices with metal framework support/MPS and most recently addition of Intel GPU /XPU.

There is, however, no generic abstracted device framework that abstracts out the hardware references from the application code and often there is a need to explicitly specify the accelerator device (such as “cuda” or “mps”) or the backend framework for scale up/out such as “nccl.” This is seen in the coverage for the PyTorch Framework Unit Tests (UT) and example code as well as READMEs for new features.

For the PyTorch UT, adaptations for other accelerators are time consuming. There are two approaches:

Fork out the unit tests completely and make adaptations locally.
Create device specific versions of these tests/examples and upstream.

Both these approaches are inefficient. Both would need constant adaptations and duplications. In this article, attempt is to highlight areas that lead to lack of generalization and propose solutions that help remove these inefficiencies.

Intel AI Accelerators

As of writing this article, Intel has two distinct family of AI accelerators for training and inferencing of AI workloads.

Intel Gaudi – Specialized ASIC (device code: hpu)
Intel GPU – GP GPU (device code: xpu)

Intel Gaudi is a family of high-performance AI accelerators providing rich software support for PyTorch. It is an out-of-tree PyTorch device which calls for the user to install the software library separately.

Details can be found in the product page at intel.com

Intel GPU is a family of general purpose (GP) GPU devices from intel. Intel GPU is an in-tree PyTorch device, with support available from PyTorch version 2.4.0

More information on intel GPU can be found in the official product pages.

PyTorch Device Model

This section provides a high-level overview of the PyTorch device model, which will serve as a foundation to understanding the device abstraction.

PyTorch allows two ways to integrate an accelerator device in its framework:

In-Tree: Devices include CPU, CUDA, META, MPS, XPU

Out-Of-Tree: These devices can have their own device name and dispatch key or can use PrivateUse1 device. Gaudi devices have their own device key “hpu” and hence do not rely on PrivateUse1.

In-tree accelerators must override core components of the PyTorch device model. These are explained in Table 1.

Component	Description
Device	Device component in PyTorch is represented by the torch.device class, which allows users to specify where tensors are allocated and computations are performed. Example: hpu = torch.device(‘hpu:0’)
Stream	Stream is a sequence of operations that are executed in order on a specific device. By default, operations are executed in the default stream, but users can create additional streams to overlap computation and data transfer, thus improving performance. Example: stm=torch.xpu.Stream() # create a new XPU stream
Event	Events manage the status of an operation that is being executed, for example a stream
Guard	Guard is used for managing device context usually required during tensor operators and op dispatching - devices need to override the c10::impl::DeviceGuardImplInterface interfaces.
Generator	Generator provides interface and infrastructure random number generation, manual seed etc. To ensure consistency of random numbers across devices. Example: torch.xpu.manual_seed()
Allocator	Allocator provides PyTorch interface for memory allocation/deallocation and hooks for device specific optimizations. Example: torch.xpu.empty_cache()

Table 1: Pytorch Device Model Components

These overrides will ensure that frontend python APIs such as torch.cuda.device_count() or torch.cuda.stream() can also be extended for new accelerator E.g., torch.hpu.device_count(). The pytorch python frontend (e.g., torch.hpu.*) binds to the corresponding C++ libraries and API calls through the libtorch_python.so Pytorch C++ frontend. The actual implementation is housed in torch/c10 library (Figure 1).

Figure 1 PyTorch Binding Layers

These adaptations necessitate writing lot of boilerplate code in addition to writing equivalent frontend API such as torch.my_device.is_available() (equivalent cuda : torch.cuda.is_available()) Hence, even though in-tree implementation has the APIs available in-tree, users still need to do some monkey patching for existing model code or for high-level features such as FSDP.

torch.cuda.is_available() #by torch.<my_device>.is_available

In-Tree implementation gives us the facility to integrate the device into the PyTorch unit test framework, however due to the lack of device abstraction in the current code, apart from the native devices ( CPU/CUDA) the other in-tree devices currently available seems to be using their own version of the UTs rather than extending the existing ones ( e.g.: MPS) .

For out-of-tree device implementation with own device key such as Intel Gaudi (HPU), the changes needed to actual PyTorch code base are minimum and involves among other things having an entry in the DeviceTypes and TensorOptions as well as registering handlers for operator dispatch with its own dispatch key. All the device specific libraries can be housed in private code base and binding the C++ libraries to the python happens at runtime.

However, out-of-tree presents its own challenges when it comes to code reusability and refactoring from cuda. We need to first install the plugin for the out-of-tree device, import the module in our code and then replace the APIs with APIs that cannot be accessed without the library being loaded.

import habana_framework.torch as ht_torch
ht_torch.hpu.is_available()

Beginning PyTorch 2.5.0 with the introduction of PyTorch autoloading out-of-tree extension feature, these imports are no longer needed. It uses Python’s Entry points mechanism to discover and load all the entry points in torch/__init__.py

#import habana_framework.torch as ht_torch not needed
torch.hpu.is_available()

The user experience is slightly improved using this feature for out-of-tree devices eliminating the need to explicitly add the import library code, e.g.: import habana_framework.torch as ht_torch.

Without the device related code available in-tree, we cannot directly add them to the PyTorch feature code (such as FSDP/DTensor/Profiler) and the Framework Unit Tests code without ensuring proper check for library availability. These checks and wrappers can quicky make the code very messy.

Abstracting Device Access

From the previous sections it is noted that although some streamlining is available for hooking in new devices to PyTorch, having device references for the front-end python code prevents the framework from being truly platform independent. A partial list of frontend APIs and decorators is listed in Table 2

API	Access in PyTorch feature validation
torch.cuda.device_count()	For retrieving number of devices/GPU, needed for finding world_size for distributed training/inference. Used extensively for Distributed use cases
torch.cuda.set_device()	For setting device /GPU Id for subsequent operation, used extensively for Distributed use cases.
torch.cuda.get_device_name()	For retrieving device name from device ID. Use extensively for Distributed use cases
torch.cuda.is_available()	For checking if cuda runtime, driver and HW is accessible. This is defacto API to check presence of accelerators in the system.
torch.cuda.current_device()	For retrieving the index of the currently active device, Use extensively for Distributed use cases.
torch.cuda.synchronize()	For blocking, used extensively in UTs, these are now skipped
skipIfCuda	For skipping specific use case not working with CUDA
dtypesIfCuda	For selectively picking dtypes for cuda
onlyCuda	For restricting use for onlyCUDA, even if the support is available for another accelerator.
torch.cuda._sleep()	For introducing delays. Multiple use seen in the UTs
torch.compile	For JIT compilation of code fragments, leveraging the dynamo infrastructure for optimizing code and kernels. This is currently restricted to default to the “inductor” backend, some devices such as intel Gaudi has no support for inductor. De-facto API for PyTorch eager mode execution.

Table 2 Device Specific APIs

Technical Debt of Using Device Specific APIs

Since the basic APIs (as seen in the partial list from the earlier section) are not device agnostic. Most of the new functionality that has been added very rapidly in recent years also become device dependent. This warrants that a non-cuda device must add the device dependency by using conditional statements or writing some device agnostic wrappers. Both these approaches are time consuming. As stated in previous sections, to verify the functionality and support of non-native devices, the UTs also need to be similarly adapted.

if device == "cuda":
        if torch.cuda.device_count() < self.world_size:
           self.skipTest("Not enough CUDA devices")
        torch.cuda.set_device(dist.get_rank())
        tensor = torch.ones([4], device=device)
        mesh = dt.DeviceMesh(device, torch.arange(4))
        res = ft_c.all_reduce(tensor, "sum", mesh)
        self.assertEqual(res, torch.tensor([4, 4, 4, 4], dtype=torch.float))
        mesh = dt.DeviceMesh(device, torch.arange(4).view(2, 2))
        res2 = ft_c.all_reduce(tensor, "sum", (mesh, 1))
        self.assertEqual(res2, torch.tensor([2, 2, 2, 2], dtype=torch.float

Device Abstraction Initiatives

There has been some effort being made to make device agnostic APIs such as seen from this PR for stream and Event: https://github.com/PyTorch/PyTorch/pull/123611. There is some generalization being done as part of introduction MTIA device: https://github.com/pytorch/pytorch/pull/123612

We need to expedite such effort to reap the benefits of community wide adoption. In addition to device specific abstraction, there are domain or feature specific APIs that needs to be abstracted. Some of these issues are highlighted in Table 3

Domains	Areas
Operators/Kernels	Mechanisms to run out-of-tree devices, skip unsupported dypes, skip unsupported ops
Distributed	Abstraction of process group creation and deletion, abstraction of common APIs such as device_count, modifying new classes that take CUDA as the default device. The basic API init_process_group(), warrants the need to add device specific backend name such as nccl
Dynamo	Device abstraction, Abstraction of backend compiler addition, out-of-tree compiler addition, working around devices that don’t have inductor support. For example, default backend can be added by checking the device capabilities or the preferred backend based on device. API such as torch.compile() now defaults to inductor
Profiler	Abstraction of adding custom profiler, adding out-of-tree profiler, which out explicitly specifying the device names
General Infrastructure improvements	Removing device references from code, harmonizing APIs to be devices agnostic, adding infra to facilitate adding of new device types with minimal effort.

Table 3 Scope of Abstraction

Initiatives and Contribution by Intel

Intel has submitted an RFC (Request for Comment) highlighting the device name dependency and inconsistency in the PyTorch frontend code, particularly the use of device specific APIs in the PyTorch Framework Unit Tests, dynamo and distributed use cases https://github.com/pytorch/rfcs/pull/66

Table 4 highlights a partial list of changes introduced.

Area	Changes
Distributed	Harmonizing APIs to be device agnostic for creation and deletion of process group within the UTs. Introducing several Pytorch frontend APIs what abstracts out the device details. Added capability to run the UTs for non-cuda devices. E.g. torch.distributed.get_default_backend_for_device() E.g. torch.get_device_module(device).device_count()
Dynamo	Added capability to run the UTs for non-cuda devices.
Profiler	Added capability to run PyTorch out-of-tree profilers
Common Infrastructure	Added mechanism for non-native devices to run the UTs.
Operators	Facility to add non-native devices to run operator UTs

Table 4 Changes introduced for Device Abstraction

RFC https://github.com/pytorch/pytorch/issues/128403 extends the abstraction for in-tree devices by introducing the torch.accelerator interface which abstracts out direct device references. This change however limits the API use for in-tree devices. Table 5 highlights a partial list of such APIs which will be abstracted out.

Cuda API	Abstracted API
torch.cuda.set_device()	torch.accelerator.set_device_index()
torch.cuda.is_available()	torch.accelerator.is_available()
torch.cuda.set_stream()	torch.accelerator.set_stream()
torch.cuda.current_device()	torch.accelerator.current_device_index()

Table 5 torch.accelerator Interface

Effort needs to be made to highlight these changes so that there is wider use of abstracted APIs.

Conclusion

PyTorch is the most popular AI framework, as such, addressing device dependency in PyTorch frontend code is crucial for enhancing the framework's versatility and usability across various hardware platforms. By transitioning from device-specific APIs to generalized, device-agnostic APIs, we can significantly streamline the development process. The implementation of generic code with proper abstraction allows for seamless integration of new devices with minimal effort. This shift not only simplifies the user experience but also promotes code reusability and maintainability. Care should be taken that such abstraction facilitates both in-tree and out-of-tree devices. Such changes empower developers to write cleaner, more adaptable code that can automatically accommodate different devices, thereby fostering innovation and efficiency in machine learning applications. Ultimately, harmonizing the PyTorch frontend Python APIs to be device-agnostic represents a significant step forward in making the framework more accessible and powerful for users across diverse computing environments. This approach not only anticipates future advancements in hardware but also aligns with the growing demand for flexible and scalable machine learning solutions.

Hicap · ‎12-20-2024

Where, when, how PyTorch can go. I think...

Transitioning to device-agnostic APIs in PyTorch will improve flexibility and simplify development. Developers can integrate new hardware with a single line of code, streamlining the process. This approach ensures PyTorch works seamlessly across diverse platforms, from GPUs to TPUs. It reduces complexity, making the codebase cleaner, more reusable, and easier to maintain. Device-agnostic APIs promote scalability, allowing PyTorch to quickly adopt new hardware advancements. This method encourages faster integration of emerging technologies like quantum or custom accelerators. It fosters innovation by making it easier to experiment with different hardware without major code changes. With this shift, PyTorch will stay relevant in a fast-evolving hardware landscape. Ultimately, this change ensures PyTorch remains adaptable, scalable, and powerful in future machine learning applications.

SCasino · ‎04-30-2025

Hi wonderful article,

So am i correct in assuming this is where problems using pytorch 2.5.0 - 2.6.0 may be stemming from? Being that the new Pytorch Device Model has essentially replace it? Or has the appropriate path for IntelGPU devices been allocated to the Pytorch 2.4.0. Thank you for you time and patience.