Compressing the Transformer: Optimization of DistilBERT with the Intel® Neural Compressor

Adam_Wolf · ‎06-14-2023

A neural network, the aptly named, biologically-inspired programming paradigm, enables the processing of data through a series of interconnected nodes in a layered structure. While the technique is indeed powerful, it has also become more apparent that as these neural networks have grown larger, they are becoming increasing hard to use and manage due to their size. Transformer natural language processing (NLP) models specifically, such as Bidirectional Encoder Representations from Transformers (BERT), are well-known and commonly used architectures, but are prone to becoming unwieldy.

Understanding Transformer Architecture for Deep Learning

The transformer was developed in the context of machine translation with an encoder/decoder architecture and adopted to other deep learning tasks and domains. The architecture’s core component is the multi-headed (self-) attention mechanism, which allows the neural network to control the mixing of information between parts of an input sequence, leading to the production of more robust representations, which in turn results in increased performance on machine learning tasks. One of the key advantages of the transformer architecture is also its ability to support parallelization. As mentioned, one of the most popular transformers is BERT, which has been around for over four years. More recent AI development has seen rapid upticks in model sizes with massively increasing numbers of parameters because doing so has started to produce unprecedented and sometimes unexpected capabilities, driving major improvements in natural language understanding and creative text generation and multilingual translation, among other improvements. But herein lies the tradeoff – with more powerful and capable models comes significantly larger model sizes thus becoming more difficult to use.

See The Rise and Rise of A.I. – Large Language Models (LLMs) - Interactive chart at InformationisBeautiful.net

GPT-3 is one of the most popular, albeit extreme, examples of this over-parameterization and size issue. While the official numbers are not known for GPT-3 – its training is believed to have cost somewhere between 10-20 million dollars – we do have estimates regarding the smaller, albeit still very large, open-source version GPT-NeoX. This Large Language Model (LLM) by EleutherAI contains over 20 billion parameters, and for training it required approximately 96 Nvidia A100 graphics cards, running over a three-month period. But even with smaller models, such as BERT, there is still a huge resource allocation needed, especially for fine-tuning in the training process. The latency and cost constraints effectively hinder the deployment of applications on top of these models, both on server and client devices. These issues therefore bring about the need for optimization strategies. Developers will want to make the process faster, more computationally efficient, and of course more sustainable at scale.

Neural Network Optimization Techniques

There are numerous neural network optimization and compression techniques including quantization, pruning, knowledge distillation, graph optimization, and mixed precision. Adding to this, in the case of compression techniques, such as quantization, the concept of precision is integral. It designates the numbers of bits used to store numerical values in computer memory, and lowering precision results in lower memory bandwidth, lower storage requirements, and higher performance, ideally with minimizing accuracy loss.

Here is a quick rundown of the aforementioned techniques:

Quantization refers to a systematic reduction of a model’s precision. A common procedure to do this in deep learning is to go from the standard FP32 datatype to INT8, specifically through post-training quantization (both static and dynamic) or with quantization aware training (QAT). Overall, quantization will allow for faster inference in the deep learning pipeline, but it may result in a small loss in accuracy.
Pruning aims to get rid of the more superfluous parts in a deep learning model and is typically differentiated between unstructured versus structured pruning. Overall, the goal is again to allow for faster inference, but may result in a small loss in accuracy.
Knowledge Distillation is a model compression method in which a smaller model is trained to mimic a pre-trained, larger model. Often referred to as the “student” and “teacher,” by mimicking the larger “teacher” neural network, the student effectively becomes a shallower version of the teacher. The knowledge is transferred from the teacher through a combined loss function. The purpose of the newly trained student model is for downstream tasks, resulting in performance gains in fine-tuning and inference.
Graph Optimization relies on the idea of treating a neural network as a directed acyclic graph. From this perspective, the graph itself can be optimized, such as by operator fusion, which in turn also brings about memory optimizations. Ultimately, graph optimizations lead to faster inference.
Mixed Precision, as the name suggests, is the combination of different numerical formats in one computational workload. It helps reduce memory usage and thus results in faster training and inference. A common combination used is FP32 with BF16. A more convenient technique within mixed precision is Auto-Mixed Precision (AMP), which provides for even faster training and inference by automatically detecting which parts in a neural network can be put into lower precision.

With BERT, one solution has been to use knowledge distillation to create a more “distilled” model called DistilBERT, hence the name. The BERT model was successfully reduced by 40% via knowledge distillation in the pre-training phase, while still retaining over 95% of BERT’s performance levels, as measured by the GLUE language understanding benchmark. A DistilBERT model has been able to perform 60% faster than a BERT model as a result of these reduced parameters and fine-tuned optimizations.

Intel® Neural Compressor (INC)

Looking now at one of the key Intel-optimized software toolkits, the Intel® Neural Compressor (INC) is an open-source Python* library that delivers a unified interface across multiple deep learning frameworks for popular network optimization and compression techniques, including quantization, pruning, knowledge distillation, graph optimizations, and mixed precision. It takes advantages of accelerations by Intel® Deep Learning Boost (Intel® DL Boost) and Intel® Advanced Matrix Extension (AMX). With the Intel® Neural Compressor, a developer can take a model from any deep learning framework or model representation (e.g., PyTorch* or TensorFlow*) and apply any of these optimization techniques. While normally these optimization techniques are prone to producing some degree of accuracy loss on a model, the Intel® Neural Compressor counters this weakness by using automatic accuracy-driven tuning strategies to help easily determine the best optimization methods. Overall, this is a super-fueled, powerful toolbox that runs on both the CPU and GPU. Take a closer look at the codebase for the Intel® Neural Compressor here.

An example of one optimization workflow for the Intel® Neural Compressor, specifically for how static post-training quantization works, can be viewed in the diagram below.

Intel® Extension for Transformers

One particularly notable success with Intel® Neural Compressor is its adoption by the well-known transformer repository Hugging Face. An NLP-focused startup, Hugging Face has developed the go-to library for transformers that exposes state-of-the-art transformer architectures to end-users. Not only are many models hosted there, but also a large selection of datasets. Furthermore, as the transformer architecture was also successfully applied to computer vision, developers can find a variety of computer vision models and even multi-modal models. In 2021, Hugging Face released Optimum, a toolkit for transformer optimization techniques that essentially includes a wrapper for the Intel® Neural Compressor.

In order to stay on top of the increasing importance of the transformer architecture in the AI and machine learning space, Intel is actively developing a toolkit that specifically addresses the optimization of this architecture. That is, the Intel® Extension for Transformers, which builds upon the functionalities of the Intel® Neural Compressor and Hugging Face, in order to make the optimization of transformers more accessible. This toolkit thus functions as an extension to Hugging Face’s transformers and Optimum, and it functions as a staging area for Intel’s latest transformer feature enhancements, such as neural architecture search. The Intel® Extension for Transformers also ensures to target many backends for deployment, including Intel optimized versions such as the Intel® Extension for PyTorch* and the Intel® Extension for TensorFlow*, as well as OpenVINO™ and Neural Engine. A public version of the Intel® Extension for Transformers is available and actively updated on GitHub* here.

Overall, transformers are a key area of neural network programming in need of strong optimization strategies as AI workflows continue to scale in this space, hence the development of tools such as the Intel® Neural Compressor and Intel® Extension for Transformers. But it is worth noting that the Intel® Neural Compressor addresses other domains and architectures as well, including computer vision, recommender systems, and more. We also encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

See the video: Compress the Transformer: Optimize Your DistilBERT Models

About our experts

Dr. Nikolai Solmsdorf
AI Software Solutions Engineer
Intel

Nikolai is responsible for software engineering specific to AI workloads, including helping Intel customers optimize for same. With extensive experience in AI/DL/ML, he has additional expertise in linguistics and NLP and currently focuses on optimizing DL models for training and inference on CPU and GPU with a particular interest in state-of-the-art transformer/NLP models. Prior to joining Intel. Nikolai contributed to various research projects such as at the Bavarian Academy of Sciences and Humanities and SOAS University of London. Nikolai holds a PhD in Asian Studies and a MSc in Computational Linguistics, both from the Ludwig Maximilian University in Munich.

Dr. Séverine Habert
AI Software Engineering Manager
Intel

Dr. Séverine Habert leads a team of AI software solutions engineers that helps customers use Intel AI Software tools. Séverine holds a PhD in Medical Imaging from the Technical University of Munich.