Learn LLM Optimization Using Transformers and PyTorch* on Intel Hardware

Adam_Wolf · ‎07-09-2024

In the fast-evolving landscape of artificial intelligence, optimizing the performance of deep learning models is critical for both efficiency and scalability. Intel has been at the forefront of developing tools and frameworks that enhance the execution speed and memory efficiency of AI models, particularly those such as Intel® Extension for Transformers* and Intel® Extension for PyTorch*. Please be sure to watch the full webinar discussing these optimizations here: Learn LLM Optimization Using Transformers and PyTorch* on CPUs & GPUs

Understanding the AI Stack

The AI stack comprises multiple layers, each crucial in optimizing LLMs. At the foundational level is the hardware layer, which includes Intel® Gaudi® AI accelerators, Intel® Data Center GPUs, Intel® Arc™ GPUs, and Intel® Xeon® CPUs. Above this layer are the acceleration libraries such as Intel® oneAPI Deep Neural Network Library (oneDNN) and Intel® oneAPI Collective Communications Library (oneCCL), which provide optimized kernels with Intel optimized instruction sets for efficient computation. The topmost layer consists of optimized frameworks like PyTorch*, which integrate with the underlying hardware and libraries to streamline model performance and ensure efficient utilization of resources.

Key Optimizations Techniques

Operator Optimizations are fundamental to enhancing the performance of LLMs. Intel replaces default operation kernels with highly-optimized Intel oneDNN kernels, leveraging advanced instruction sets like Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Matrix Extensions (Intel® AMX), and Intel® Xe Matrix Extensions (Intel® XMX). This optimization is designed to be precision-flexible, supporting a range of data types from FP32 to INT4, ensuring that applications can run at optimal speed and precision.

Graph Optimizations further improve performance by reducing the number of memory accesses required during computation. For instance, fusing bandwidth-limited operations such as activation functions (e.g., ReLU or Tanh) with other layers (e.g., Conv+ReLU+Sum) minimizes memory access times. This approach is particularly beneficial for models like ResNet-50, where a significant portion of time is spent on bandwidth-limited operations. In the context of LLMs, specific fusion techniques like multi-head attention fusion and linear post-ops fusion are employed to enhance performance using Intel® Extension for PyTorch* in JIT/Torchscript mode.

Memory Management is crucial for optimizing the performance of LLMs, which are often memory-intensive. The Segment KV Cache technique optimizes memory usage by pre-filling key/value pairs before the autoregressive decoding begins and using pre-allocated buffers during the decoding phase. This method reduces the need for real-time memory adjustments, thereby improving efficiency. Similarly, the Indirect Access KV Cache uses pre-allocated buffers and beam index history to manage memory effectively, reducing the overhead associated with memory access during inference.

Model Compression involves techniques like quantization, which systematically reduce the precision of weights and activations from FP32 to lower precision formats such as INT8 or INT4. This reduction decreases memory bandwidth requirements, improves inference speed, and minimizes model size. SmoothQuant is a post-training quantization method that migrates quantization difficulty from activations to weights, smoothing out activation outliers and ensuring efficient hardware utilization while preserving model accuracy.

Custom Operators also play a significant role in optimization. Weight-only-quantization focuses on quantizing the weights of the model while maintaining higher precision for input and output activations. This method uses customized GEMM (General Matrix Multiply) kernels, which are optimized for weight-only quantization, enhancing computational efficiency without significantly impacting accuracy. The use of Explicit SIMD(ESIMD) extensions allows for fine-grained control over hardware features, further optimizing performance.

Optimizations for Intel Hardware

Intel Extension for PyTorch provides APIs for applying these optimizations to both CPU- and GPU-based training and inference. By utilizing these APIs, you can ensure that your models are optimized to run efficiently on Intel hardware. The extension includes scripts and environment setups designed to maximize hardware utilization, making it easier for developers to implement these optimizations.

Intel® Gaudi® AI accelerators are another key component of Intel’s optimization strategy. Integrated with PyTorch through the Intel® Gaudi® software suite, which efficiently maps neural network topologies onto Gaudi hardware This integration supports key optimizations and kernel libraries, enhancing the performance of deep learning applications.

Intel® Extension for Transformers

Another key component is Intel Extension for Transformers which enhances the Hugging Face* Transformers library by integrating hardware-specific optimizations and adding new functionalities. This extension supports model compression techniques such as SmoothQuant, weight-only quantization, and QLoRA (Quantized Low-Rank Adaptation) fine-tuning It also introduces Neural Chat, a framework for developing and deploying customizable chatbots with minimal code changes.

Neural Chat enables the integration of various plugins for commonly used pipelines, such as retrieval-augmented generation (RAG) and audio processing. It simplifies the deployment of optimized chatbots by incorporating the necessary optimizations directly into the pipeline configuration.

Neural Speed and Distributed Inference

Neural Speed, a dedicated library introduced by Intel, streamlines inference of LLMs on Intel platforms. Inspired by projects like Llama CPP, Neural Speed facilitates efficient inference through state-of-the-art quantization algorithms. By loading models in 4-bit or 8-bit precision by default, it enhances both speed and memory efficiency, making it suitable for diverse AI applications.

Moreover, Intel's support for distributed inference using DeepSpeed extends these optimizations across multiple nodes or GPUs. Intel® Extension for DeepSpeed* brings Intel GPU support to DeepSpeed*. It comes with the following components:

DeepSpeed Accelerator Interface implementation
DeepSpeed op builder implementation for XPU
DeepSpeed op builder kernel code

This Intel-optimized extension leverages oneCCL to distribute computation tasks efficiently, reducing memory footprint and improving overall throughput. This capability is crucial for scaling AI applications across heterogeneous computing environments.

Applying Optimizations in Practice

Actually implementing these optimizations with Intel's tools is straightforward, leveraging the extensions for both PyTorch and Transformers frameworks. For instance, Intel Extension for Transformers enhances model compression techniques like smooth quantization and weight-only quantization directly within the familiar Transformers API. You can optimize models simply by configuring the quantization parameters and leveraging the integrated APIs.

Similarly, Intel Extension for PyTorch provides a versatile framework for optimizing both LLMs and other deep learning models. With CPU optimizations such as NUMA control and graph optimizations, along with GPU-centric features like tensor parallelism, this extension enables fine-tuning and deployment across a range of hardware configurations.

Conclusion

By leveraging Intel’s comprehensive hardware stack, acceleration libraries, and optimized frameworks, you can substantially improve the performance and efficiency of your AI models. These optimizations not only enhance computational speed and reduce latency, but also lower operational costs and energy consumption associated with running large-scale AI applications.

You can explore these optimizations on Intel® Tiber™ Developer Cloud, using getting started samples from Intel Extension for Transformers and Intel Extension for PyTorch. By integrating these techniques, you can ensure your LLMs are running at peak performance on Intel hardware.

We encourage you to check out Intel’s other AI tools and framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI software portfolio.

About the Speakers

Pramod Pai
AI Software Solutions Engineer

Pramod Pai is an AI Software Solutions Engineer at Intel who enables customers to optimize their Machine Learning workflows using solutions from Intel®. His areas of focus include Intel® oneAPI AI Analytics Toolkit, and Intel® Extension for PyTorch*. He holds a Master’s degree in Information Systems from Northeastern University.

Kevin Ta
AI Software Solutions Engineer, Intel

Kevin is an AI Software Solutions Engineer at Intel. He enables customers to develop and accelerate their AI workloads using performance optimizations provided by Intel’s artificial intelligence and data analytics product line. He holds a Ph.D. in Engineering & Applied Science with a focus on medical image analysis from Yale University and a B.S. in Biomedical Engineering with a specialization in bioinstrumentation and electronic systems from the University of Connecticut.

Alex Sin
AI Software Solutions Engineer

Alex is an AI Software Solutions Engineer and has been enabling customers to build their AI applications using Intel’s hardware architectures and software stacks. He provides technical consulting on using the Intel AI Analytics Toolkit on Intel® Xeon® Scalable Processors to optimize accelerated computing in AI and machine learning. Previously Alex worked at Viasat developing and testing microcontroller and FPGA-based embedded security systems used by the government in radios, warfighters, and the navy. Alex holds Bachelors and Masters Degree in Electrical Engineering