Introduction
The rapid rise of artificial intelligence (AI) applications, particularly large language models (LLMs), has revolutionized numerous industries. LLMs, which use deep learning techniques to process and generate human-like text, have become integral in applications such as chatbots, sentiment analysis and automated content creation. This surge in AI capabilities has created an unprecedented demand for computational power, as training and deploying these sophisticated models requires significant resources. The exponential growth in compute need is outpacing the available supply, leading to concerns of a potential bottleneck in AI development. Nvidia's CUDA software platform has set up a stronghold in AI acceleration market commanding dominant market share for AI chips. Competitors face significant hurdles due to Nvidia's established ecosystem and the inertia created by widespread adoption of CUDA based workflows. The dominance of CUDA in open-source AI frameworks, particularly in PyTorch, stems from its early establishment as a robust platform for GPU computing.
Support for diverse hardware backends is being added in PyTorch over the years, however, native support is restricted to CPU, CUDA, and META devices. The support for other accelerators is achieved through out of tree extensions such as Google’s TPU/XLA, Intel’s Gaudi/HPU and Huawei’s NPU as well as in-tree additions such as Apple devices with metal framework support/MPS and most recently addition of Intel GPU /XPU.
There is, however, no generic abstracted device framework that abstracts out the hardware references from the application code and often there is a need to explicitly specify the accelerator device (such as “cuda” or “mps”) or the backend framework for scale up/out such as “nccl.” This is seen in the coverage for the PyTorch Framework Unit Tests (UT) and example code as well as READMEs for new features.
For the PyTorch UT, adaptations for other accelerators are time consuming. There are two approaches:
- Fork out the unit tests completely and make adaptations locally.
- Create device specific versions of these tests/examples and upstream.
Both these approaches are inefficient. Both would need constant adaptations and duplications. In this article, attempt is to highlight areas that lead to lack of generalization and propose solutions that help remove these inefficiencies.
Intel AI Accelerators
As of writing this article, Intel has two distinct family of AI accelerators for training and inferencing of AI workloads.
- Intel Gaudi – Specialized ASIC (device code: hpu)
- Intel GPU – GP GPU (device code: xpu)
Intel Gaudi is a family of high-performance AI accelerators providing rich software support for PyTorch. It is an out-of-tree PyTorch device which calls for the user to install the software library separately.
Details can be found in the product page at intel.com
Intel GPU is a family of general purpose (GP) GPU devices from intel. Intel GPU is an in-tree PyTorch device, with support available from PyTorch version 2.4.0
More information on intel GPU can be found in the official product pages.
PyTorch Device Model
This section provides a high-level overview of the PyTorch device model, which will serve as a foundation to understanding the device abstraction.
PyTorch allows two ways to integrate an accelerator device in its framework:
In-Tree: Devices include CPU, CUDA, META, MPS, XPU
Out-Of-Tree: These devices can have their own device name and dispatch key or can use PrivateUse1 device. Gaudi devices have their own device key “hpu” and hence do not rely on PrivateUse1.
In-tree accelerators must override core components of the PyTorch device model. These are explained in Table 1.
Component | Description |
Device | Device component in PyTorch is represented by the torch.device class, which allows users to specify where tensors are allocated and computations are performed. Example: hpu = torch.device(‘hpu:0’) |
Stream | Stream is a sequence of operations that are executed in order on a specific device. By default, operations are executed in the default stream, but users can create additional streams to overlap computation and data transfer, thus improving performance. Example: stm=torch.xpu.Stream() # create a new XPU stream |
Event | Events manage the status of an operation that is being executed, for example a stream |
Guard | Guard is used for managing device context usually required during tensor operators and op dispatching - devices need to override the c10::impl::DeviceGuardImplInterface interfaces. |
Generator | Generator provides interface and infrastructure random number generation, manual seed etc. To ensure consistency of random numbers across devices. Example: torch.xpu.manual_seed() |
Allocator | Allocator provides PyTorch interface for memory allocation/deallocation and hooks for device specific optimizations. Example: torch.xpu.empty_cache() |
Table 1: Pytorch Device Model Components
These overrides will ensure that frontend python APIs such as torch.cuda.device_count() or torch.cuda.stream() can also be extended for new accelerator E.g., torch.hpu.device_count(). The pytorch python frontend (e.g., torch.hpu.*) binds to the corresponding C++ libraries and API calls through the libtorch_python.so Pytorch C++ frontend. The actual implementation is housed in torch/c10 library (Figure 1).
Figure 1 PyTorch Binding Layers
These adaptations necessitate writing lot of boilerplate code in addition to writing equivalent frontend API such as torch.my_device.is_available() (equivalent cuda : torch.cuda.is_available()) Hence, even though in-tree implementation has the APIs available in-tree, users still need to do some monkey patching for existing model code or for high-level features such as FSDP.
torch.cuda.is_available() #by torch.<my_device>.is_available
In-Tree implementation gives us the facility to integrate the device into the PyTorch unit test framework, however due to the lack of device abstraction in the current code, apart from the native devices ( CPU/CUDA) the other in-tree devices currently available seems to be using their own version of the UTs rather than extending the existing ones ( e.g.: MPS) .
For out-of-tree device implementation with own device key such as Intel Gaudi (HPU), the changes needed to actual PyTorch code base are minimum and involves among other things having an entry in the DeviceTypes and TensorOptions as well as registering handlers for operator dispatch with its own dispatch key. All the device specific libraries can be housed in private code base and binding the C++ libraries to the python happens at runtime.
However, out-of-tree presents its own challenges when it comes to code reusability and refactoring from cuda. We need to first install the plugin for the out-of-tree device, import the module in our code and then replace the APIs with APIs that cannot be accessed without the library being loaded.
import habana_framework.torch as ht_torch
ht_torch.hpu.is_available()
Beginning PyTorch 2.5.0 with the introduction of PyTorch autoloading out-of-tree extension feature, these imports are no longer needed. It uses Python’s Entry points mechanism to discover and load all the entry points in torch/__init__.py
#import habana_framework.torch as ht_torch not needed
torch.hpu.is_available()
The user experience is slightly improved using this feature for out-of-tree devices eliminating the need to explicitly add the import library code, e.g.: import habana_framework.torch as ht_torch.
Without the device related code available in-tree, we cannot directly add them to the PyTorch feature code (such as FSDP/DTensor/Profiler) and the Framework Unit Tests code without ensuring proper check for library availability. These checks and wrappers can quicky make the code very messy.
Abstracting Device Access
From the previous sections it is noted that although some streamlining is available for hooking in new devices to PyTorch, having device references for the front-end python code prevents the framework from being truly platform independent. A partial list of frontend APIs and decorators is listed in Table 2
API | Access in PyTorch feature validation |
torch.cuda.device_count()
| For retrieving number of devices/GPU, needed for finding world_size for distributed training/inference. Used extensively for Distributed use cases |
torch.cuda.set_device() | For setting device /GPU Id for subsequent operation, used extensively for Distributed use cases. |
torch.cuda.get_device_name() | For retrieving device name from device ID. Use extensively for Distributed use cases |
torch.cuda.is_available()
| For checking if cuda runtime, driver and HW is accessible. This is defacto API to check presence of accelerators in the system. |
torch.cuda.current_device() | For retrieving the index of the currently active device, Use extensively for Distributed use cases. |
torch.cuda.synchronize() | For blocking, used extensively in UTs, these are now skipped |
skipIfCuda | For skipping specific use case not working with CUDA |
dtypesIfCuda | For selectively picking dtypes for cuda |
onlyCuda | For restricting use for onlyCUDA, even if the support is available for another accelerator. |
torch.cuda._sleep() | For introducing delays. Multiple use seen in the UTs |
torch.compile | For JIT compilation of code fragments, leveraging the dynamo infrastructure for optimizing code and kernels. This is currently restricted to default to the “inductor” backend, some devices such as intel Gaudi has no support for inductor. De-facto API for PyTorch eager mode execution. |
Table 2 Device Specific APIs
Technical Debt of Using Device Specific APIs
Since the basic APIs (as seen in the partial list from the earlier section) are not device agnostic. Most of the new functionality that has been added very rapidly in recent years also become device dependent. This warrants that a non-cuda device must add the device dependency by using conditional statements or writing some device agnostic wrappers. Both these approaches are time consuming. As stated in previous sections, to verify the functionality and support of non-native devices, the UTs also need to be similarly adapted.
if device == "cuda":
if torch.cuda.device_count() < self.world_size:
self.skipTest("Not enough CUDA devices")
torch.cuda.set_device(dist.get_rank())
tensor = torch.ones([4], device=device)
mesh = dt.DeviceMesh(device, torch.arange(4))
res = ft_c.all_reduce(tensor, "sum", mesh)
self.assertEqual(res, torch.tensor([4, 4, 4, 4], dtype=torch.float))
mesh = dt.DeviceMesh(device, torch.arange(4).view(2, 2))
res2 = ft_c.all_reduce(tensor, "sum", (mesh, 1))
self.assertEqual(res2, torch.tensor([2, 2, 2, 2], dtype=torch.float
Device Abstraction Initiatives
There has been some effort being made to make device agnostic APIs such as seen from this PR for stream and Event: https://github.com/PyTorch/PyTorch/pull/123611. There is some generalization being done as part of introduction MTIA device: https://github.com/pytorch/pytorch/pull/123612
We need to expedite such effort to reap the benefits of community wide adoption. In addition to device specific abstraction, there are domain or feature specific APIs that needs to be abstracted. Some of these issues are highlighted in Table 3
Domains | Areas |
Operators/Kernels | Mechanisms to run out-of-tree devices, skip unsupported dypes, skip unsupported ops |
Distributed
| Abstraction of process group creation and deletion, abstraction of common APIs such as device_count, modifying new classes that take CUDA as the default device. The basic API init_process_group(), warrants the need to add device specific backend name such as nccl |
Dynamo
| Device abstraction, Abstraction of backend compiler addition, out-of-tree compiler addition, working around devices that don’t have inductor support. For example, default backend can be added by checking the device capabilities or the preferred backend based on device. API such as torch.compile() now defaults to inductor |
Profiler
| Abstraction of adding custom profiler, adding out-of-tree profiler, which out explicitly specifying the device names |
General Infrastructure improvements | Removing device references from code, harmonizing APIs to be devices agnostic, adding infra to facilitate adding of new device types with minimal effort.
|
Table 3 Scope of Abstraction
Initiatives and Contribution by Intel
Intel has submitted an RFC (Request for Comment) highlighting the device name dependency and inconsistency in the PyTorch frontend code, particularly the use of device specific APIs in the PyTorch Framework Unit Tests, dynamo and distributed use cases https://github.com/pytorch/rfcs/pull/66
Table 4 highlights a partial list of changes introduced.
Area | Changes |
Distributed | Harmonizing APIs to be device agnostic for creation and deletion of process group within the UTs. Introducing several Pytorch frontend APIs what abstracts out the device details. Added capability to run the UTs for non-cuda devices. E.g. torch.distributed.get_default_backend_for_device() E.g. torch.get_device_module(device).device_count() |
Dynamo | Added capability to run the UTs for non-cuda devices. |
Profiler | Added capability to run PyTorch out-of-tree profilers |
Common Infrastructure | Added mechanism for non-native devices to run the UTs. |
Operators | Facility to add non-native devices to run operator UTs |
Table 4 Changes introduced for Device Abstraction
RFC https://github.com/pytorch/pytorch/issues/128403 extends the abstraction for in-tree devices by introducing the torch.accelerator interface which abstracts out direct device references. This change however limits the API use for in-tree devices. Table 5 highlights a partial list of such APIs which will be abstracted out.
Cuda API | Abstracted API |
torch.cuda.set_device() | |
torch.cuda.is_available() | |
torch.cuda.set_stream() | |
torch.cuda.current_device() |
Table 5 torch.accelerator Interface
Effort needs to be made to highlight these changes so that there is wider use of abstracted APIs.
Conclusion
PyTorch is the most popular AI framework, as such, addressing device dependency in PyTorch frontend code is crucial for enhancing the framework's versatility and usability across various hardware platforms. By transitioning from device-specific APIs to generalized, device-agnostic APIs, we can significantly streamline the development process. The implementation of generic code with proper abstraction allows for seamless integration of new devices with minimal effort. This shift not only simplifies the user experience but also promotes code reusability and maintainability. Care should be taken that such abstraction facilitates both in-tree and out-of-tree devices. Such changes empower developers to write cleaner, more adaptable code that can automatically accommodate different devices, thereby fostering innovation and efficiency in machine learning applications. Ultimately, harmonizing the PyTorch frontend Python APIs to be device-agnostic represents a significant step forward in making the framework more accessible and powerful for users across diverse computing environments. This approach not only anticipates future advancements in hardware but also aligns with the growing demand for flexible and scalable machine learning solutions.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.