Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
649 Discussions

PyTorch Enhancements for Accelerator Abstraction

AnkurNeog
Employee
0 1 1,326

Introduction

The rapid rise of artificial intelligence (AI) applications, particularly large language models (LLMs), has revolutionized numerous industries. LLMs, which use deep learning techniques to process and generate human-like text, have become integral in applications such as chatbots, sentiment analysis and automated content creation. This surge in AI capabilities has created an unprecedented demand for computational power, as training and deploying these sophisticated models requires significant resources. The exponential growth in compute need is outpacing the available supply, leading to concerns of a potential bottleneck in AI development. Nvidia's CUDA software platform has set up a stronghold in AI acceleration market commanding dominant market share for AI chips. Competitors face significant hurdles due to Nvidia's established ecosystem and the inertia created by widespread adoption of CUDA based workflows. The dominance of CUDA in open-source AI frameworks, particularly in PyTorch, stems from its early establishment as a robust platform for GPU computing.

Support for diverse hardware backends is being added in PyTorch over the years,  however, native support  is restricted to CPU, CUDA, and META devices. The support for other accelerators is achieved through out of tree extensions such as Google’s TPU/XLA, Intel’s Gaudi/HPU and Huawei’s NPU as well as in-tree additions such as Apple devices with metal framework support/MPS and most recently addition of Intel GPU /XPU.

There is, however, no generic abstracted device framework that abstracts out the hardware references from the application code and often there is a need to explicitly specify the accelerator device (such as “cuda” or “mps”) or the backend framework for scale up/out such as “nccl.” This is seen in the coverage for the PyTorch Framework Unit Tests (UT) and  example code  as well as READMEs for new features.

For the PyTorch UT, adaptations for other accelerators are time consuming. There are two approaches: 

  • Fork out the unit tests completely and make adaptations locally.
  • Create device specific versions of these tests/examples and upstream.

Both these approaches are inefficient. Both would need constant adaptations and duplications. In this article, attempt is to highlight areas that lead to lack of generalization and propose solutions that help remove these inefficiencies. 

Intel AI Accelerators

As of writing this article, Intel has two distinct family of AI accelerators for training and inferencing of AI workloads.

  • Intel Gaudi – Specialized ASIC (device code: hpu)
  • Intel GPU – GP GPU (device code: xpu)

Intel Gaudi is a family of high-performance AI accelerators providing rich software support for PyTorch. It is an out-of-tree PyTorch device which calls for the user to install the software library separately.

Details can be found in the product page at intel.com

Intel GPU is a family of general purpose (GP) GPU devices  from intel. Intel GPU is an in-tree PyTorch device, with support available from PyTorch version 2.4.0

More information on intel GPU can be found in the official product pages.

PyTorch Device Model

This section provides a high-level overview of the PyTorch device model, which will serve as a foundation to understanding the device abstraction.

PyTorch allows two ways to integrate an accelerator device in its framework: 

In-Tree: Devices include CPU, CUDA, META, MPS, XPU

Out-Of-Tree: These devices can have their own device name and dispatch key or can use PrivateUse1 device. Gaudi devices have their own device key “hpu” and hence do not rely on PrivateUse1. 

In-tree accelerators must override core components of the PyTorch device model. These are explained in Table 1.

Component

Description

Device

Device component in PyTorch is represented by the torch.device class, which allows users to specify where tensors are allocated and computations are performed.

Example: hpu = torch.device(‘hpu:0’) 

Stream

Stream is a sequence of operations that are executed in order on a specific device. By default, operations are executed in the default stream, but users can create additional streams to overlap computation and data transfer, thus improving performance.

Example: stm=torch.xpu.Stream() # create a new XPU stream

Event

Events manage the status of an operation that is being executed, for example a stream

Guard

Guard is used for managing device context usually required during tensor operators and op dispatching - devices need to override the c10::impl::DeviceGuardImplInterface interfaces.

Generator

Generator provides interface and infrastructure random number generation, manual seed etc. To ensure consistency of random numbers across devices.

Example: torch.xpu.manual_seed()

Allocator

Allocator provides PyTorch interface for memory allocation/deallocation and hooks for device specific optimizations.

Example: torch.xpu.empty_cache()

Table 1: Pytorch Device Model Components

These overrides will ensure that frontend python APIs such as torch.cuda.device_count() or torch.cuda.stream() can also be extended for new accelerator E.g., torch.hpu.device_count(). The pytorch python frontend (e.g., torch.hpu.*) binds to the corresponding C++ libraries and API calls through the libtorch_python.so Pytorch C++ frontend. The actual implementation is housed in torch/c10 library (Figure 1).

AnkurNeog_6-1734588612608.png

      Figure 1 PyTorch Binding Layers

These adaptations necessitate writing lot of boilerplate code in addition to writing equivalent frontend API such as torch.my_device.is_available() (equivalent cuda : torch.cuda.is_available()) Hence, even though in-tree implementation has the APIs available in-tree, users still need to do some monkey patching for existing model code or for high-level features such as FSDP. 

 

torch.cuda.is_available() #by torch.<my_device>.is_available

 

In-Tree implementation gives us the facility to integrate the device into the PyTorch unit test framework, however due to the lack of device abstraction in the current code, apart from the native devices ( CPU/CUDA) the other in-tree devices currently available seems to be using their own version of the UTs rather than extending the existing ones ( e.g.: MPS) .

For out-of-tree device implementation with own device key such as Intel Gaudi (HPU), the changes needed to actual PyTorch code base are minimum and involves among other things having an entry in the DeviceTypes and TensorOptions as well as registering handlers for operator dispatch with its own dispatch key. All the device specific libraries can be housed in private code base and binding the C++ libraries to the python happens at runtime.

However, out-of-tree presents its own challenges when it comes to code reusability and refactoring from cuda. We need to first install the plugin for the out-of-tree device, import the module in our code and then replace the APIs with APIs that cannot be accessed without the library being loaded.

 

import habana_framework.torch as ht_torch
ht_torch.hpu.is_available()

 

Beginning PyTorch 2.5.0 with the introduction of PyTorch autoloading out-of-tree extension feature, these imports are no longer needed. It uses Python’s Entry points mechanism to discover and load all the entry points in torch/__init__.py

 

#import habana_framework.torch as ht_torch not needed
torch.hpu.is_available()

 

The user experience is slightly improved using this feature for out-of-tree devices eliminating the need to explicitly add the import library code, e.g.: import habana_framework.torch as ht_torch.

Without the device related code available in-tree, we cannot directly add them to the PyTorch feature code (such as FSDP/DTensor/Profiler) and the Framework Unit Tests code without ensuring proper check for library availability. These checks and wrappers can quicky make the code very messy.

Abstracting Device Access

From the previous sections it is noted that  although some streamlining is available for hooking in new devices to PyTorch, having device references for the front-end python code prevents the framework from being truly platform independent. A partial list of frontend APIs and decorators is listed  in Table 2

API

Access in PyTorch feature validation

torch.cuda.device_count()

 

For retrieving number of devices/GPU, needed for finding world_size for distributed training/inference. Used extensively for Distributed use cases

torch.cuda.set_device()

For setting device /GPU Id for subsequent operation, used extensively for Distributed use cases.

torch.cuda.get_device_name()

For retrieving device name from device ID. Use extensively for Distributed use cases

torch.cuda.is_available()

 

For checking if cuda runtime, driver and HW is accessible. This is defacto API to check presence of accelerators in the system.

torch.cuda.current_device()

For retrieving the index of the currently active device, Use extensively for Distributed use cases.

torch.cuda.synchronize()

For blocking, used extensively in UTs, these are now skipped

skipIfCuda

 For skipping specific use case not working with CUDA

dtypesIfCuda

For selectively picking dtypes for cuda

onlyCuda

For restricting use for onlyCUDA, even if the support is available for another accelerator.

torch.cuda._sleep()

For introducing delays. Multiple use seen in the UTs

torch.compile

For JIT compilation of code fragments, leveraging the dynamo infrastructure for optimizing code and kernels. This is currently restricted to default to the “inductor” backend, some devices such as intel Gaudi has no support for inductor. De-facto API for PyTorch eager mode execution.

Table 2 Device Specific APIs

Technical Debt of Using Device Specific APIs

Since the basic APIs (as seen in the partial list from the earlier section) are not device agnostic. Most of the new functionality that has been added very rapidly in recent years also become device dependent. This warrants that a non-cuda device must add the device dependency by using conditional statements or writing some device agnostic wrappers. Both these approaches are time consuming. As stated in previous sections, to verify the functionality and support of non-native devices, the UTs also need to be similarly adapted.

 

if device == "cuda":
        if torch.cuda.device_count() < self.world_size:
           self.skipTest("Not enough CUDA devices")
        torch.cuda.set_device(dist.get_rank())
        tensor = torch.ones([4], device=device)
        mesh = dt.DeviceMesh(device, torch.arange(4))
        res = ft_c.all_reduce(tensor, "sum", mesh)
        self.assertEqual(res, torch.tensor([4, 4, 4, 4], dtype=torch.float))
        mesh = dt.DeviceMesh(device, torch.arange(4).view(2, 2))
        res2 = ft_c.all_reduce(tensor, "sum", (mesh, 1))
        self.assertEqual(res2, torch.tensor([2, 2, 2, 2], dtype=torch.float

 

Device Abstraction Initiatives

There has been some effort being made to make device agnostic APIs such as seen from this PR for stream and Event: https://github.com/PyTorch/PyTorch/pull/123611. There is some generalization being done as part of introduction MTIA device: https://github.com/pytorch/pytorch/pull/123612

We need to expedite such effort to reap the benefits of community wide adoption. In addition to device specific abstraction, there are domain or feature specific APIs that needs to be abstracted. Some of these issues are highlighted in Table 3

Domains

Areas

Operators/Kernels

Mechanisms to run out-of-tree devices, skip unsupported dypes, skip unsupported ops

Distributed

 

Abstraction of process group creation and deletion, abstraction of common APIs such as device_count, modifying new classes that take CUDA as the default device. The basic API init_process_group(), warrants the need to add device specific backend name such as nccl

Dynamo

 

Device abstraction, Abstraction of backend compiler addition, out-of-tree compiler addition, working around devices that don’t have inductor support. For example, default backend can be added  by checking the device capabilities or the preferred backend based on device. API such as torch.compile() now defaults to inductor

Profiler

 

Abstraction of adding custom profiler, adding out-of-tree profiler, which out explicitly specifying the device names

General Infrastructure improvements

Removing device references from code, harmonizing APIs to be devices agnostic, adding infra to facilitate adding of new device types with minimal effort.

 

Table 3 Scope of Abstraction

Initiatives and Contribution by Intel

Intel has  submitted an RFC (Request for Comment)  highlighting the device name dependency and inconsistency in the PyTorch frontend code, particularly the use of device specific APIs in the PyTorch Framework Unit Tests, dynamo and distributed use cases https://github.com/pytorch/rfcs/pull/66

Table 4 highlights a partial list of changes introduced.

Area

Changes

Distributed

Harmonizing APIs to be device agnostic for creation and deletion of process group within the UTs. Introducing several Pytorch frontend APIs what abstracts out the device details. Added capability to run the UTs for non-cuda devices.

E.g. torch.distributed.get_default_backend_for_device()

E.g. torch.get_device_module(device).device_count()

Dynamo

Added capability to run the UTs for non-cuda devices.

Profiler

Added capability to run PyTorch out-of-tree profilers

Common Infrastructure

Added mechanism for non-native devices to run the UTs.

Operators

Facility to add non-native devices to run operator UTs

Table 4  Changes introduced for Device Abstraction 

RFC https://github.com/pytorch/pytorch/issues/128403 extends the abstraction for in-tree devices by introducing  the torch.accelerator interface which abstracts out direct device references. This change however limits the API use for in-tree devices. Table 5 highlights a partial list of such APIs which will be abstracted out.

Cuda API

Abstracted API

torch.cuda.set_device()

torch.accelerator.set_device_index()

torch.cuda.is_available()

torch.accelerator.is_available()

torch.cuda.set_stream()

torch.accelerator.set_stream()

torch.cuda.current_device()

torch.accelerator.current_device_index() 

Table 5 torch.accelerator Interface

Effort needs to be made to highlight these changes so that there is wider use of abstracted APIs.

Conclusion

PyTorch is the most popular AI framework, as such, addressing device dependency in PyTorch frontend code is crucial for enhancing the framework's versatility and usability across various hardware platforms. By transitioning from device-specific APIs to generalized, device-agnostic APIs, we can significantly streamline the development process. The implementation of generic code with proper abstraction allows for seamless integration of new devices with minimal effort. This shift not only simplifies the user experience but also promotes code reusability and maintainability. Care should be taken that such abstraction facilitates both in-tree and out-of-tree devices. Such changes empower developers to write cleaner, more adaptable code that can automatically accommodate different devices, thereby fostering innovation and efficiency in machine learning applications. Ultimately, harmonizing the PyTorch frontend Python APIs to be device-agnostic represents a significant step forward in making the framework more accessible and powerful for users across diverse computing environments. This approach not only anticipates future advancements in hardware but also aligns with the growing demand for flexible and scalable machine learning solutions.

Tags (1)
1 Comment
Hicap
Beginner

 

Where, when, how PyTorch can go. I think...

Transitioning to device-agnostic APIs in PyTorch will improve flexibility and simplify development. Developers can integrate new hardware with a single line of code, streamlining the process. This approach ensures PyTorch works seamlessly across diverse platforms, from GPUs to TPUs. It reduces complexity, making the codebase cleaner, more reusable, and easier to maintain. Device-agnostic APIs promote scalability, allowing PyTorch to quickly adopt new hardware advancements. This method encourages faster integration of emerging technologies like quantum or custom accelerators. It fosters innovation by making it easier to experiment with different hardware without major code changes. With this shift, PyTorch will stay relevant in a fast-evolving hardware landscape. Ultimately, this change ensures PyTorch remains adaptable, scalable, and powerful in future machine learning applications.