Intel Presents SYCL implementation of Fully-Fused Multi-Layer Perceptrons for Intel GPUs

kai_yuan · ‎05-17-2024

Kai Yuan is an AI/ML Research Scientist at Intel Labs researching in the fields of Generative AI, Content Creation, Embodied AI, and Immersive Experiences.

Highlights:

Intel is proud to present the first SYCL implementation of fully-fused Multi-Layer Perceptrons applied on Intel GPUs that support Intel Xe Matrix Extensions (XMX) instructions and an open-sourced repository of the implementation.
The implementation boasts numerous features, including high-performance computing, compatibility with PyTorch, versatile neural network structures, multi-resolution hash encoding, and cross-platform utilization.
Results show that the implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia’s H100 GPU by up to a factor of 19.

Multi-Layer Perceptrons (MLPs) are used as the main Neural Network architecture for many of today’s Machine Learning (ML) applications, such as the representation of the solution operator of partial differential equations, the density or color function in Neural Radiance Fields (NeRFs) objects and replacing classical ray-tracing with Neural Ray Tracing. MLPs are characterized by their fully connected layers, where the neuron of every layer is connected to every neuron in the previous and subsequent layers. A key feature of MLPs is that each neuron’s output is independent of its neighbors in the same layer, making it suitable for fully-fusing operations.

Intel proud to present the first SYCL implementation of fully-fused MLPs applied on Intel GPUs that support Intel Xe Matrix Extensions (XMX) instructions and an open-sourced repository of the implementation. This implementation minimizes the slow global memory access by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning.

Features

Our implementation boasts numerous features, the first of which is high-performance computing; the system is optimized to run efficiently on Intel Data Center GPUs, thus enabling high-throughput training and inference. The algorithm also provides Python bindings that integrate seamlessly with the PyTorch ecosystem, enabling users to include GPU-accelerated MLPs in PyTorch applications. Furthermore, it delivers versatility by supporting networks with multiple hidden layers and a variety of neuron configurations to fit different use cases and performance requirements. Additionally, it includes the implementation of Multi-Resolution Hash Encoding, allowing the network to handle high-frequency features effectively, and is designed to be run on various Intel GPUs, maximizing the portability and usability of the framework across different systems.

Performance

Our fully-fused MLP implementation increases the performance of several commonly used AI tasks. To demonstrate these performance benefits, we compared our SYCL implementation on an Intel Data Center GPU Max 1550 with the CUDA implementation on a Nvidia H100 GPU and PyTorch using both Intel Extension for PyTorch (IPEX) and CUDA backend.

Results show that the implementation outperforms an equivalent CUDA implementation for MLPs with width 64 by a factor of up to 2.84 in inference and 1.75 in training in our tests, demonstrating the effectiveness of our approach, and outperforms the PyTorch implementation by up to a factor of 30.

We further showcased the efficiency of our implementation in three significant areas: Image Compression, Neural Radiance Fields (NeRF), and Physics-Informed Machine Learning. Across all these domains, our approach demonstrated substantial improvements, achieving factors up to 30 times when compared to conventional PyTorch implementations and up to 2.84 times over highly optimized CUDA implementations.

Looking to the Future

In the future, we aim to further optimize our implementation with a strong focus on more efficient usage of registers to reduce stalls. In addition, we may be able to reduce the utilization of shared local memory (SLM) and enable the loads of multiple weights matrices in the SLM, which would reduce the number of necessary barriers. Other areas of focus will be to increase the occupancy for small batch sizes and to optimize the fusion of the final matrix products into the backward pass.

In addition to further performance optimization, we also plan to explore the use of Intel’s ESIMD SYCL extension for our implementation and to generalize our library to various data types and larger network widths.

To allow for wider usage and contributions from the community, our implementation is open-sourced and available at https://github.com/intel/tiny-dpcpp-nn.