Can we Improve Early-Exit Transformers? Novel Adaptive Inference Method Presented at ACL 2023

JonathanMamou · ‎07-13-2023

Jonathan Mamou is a senior Natural Language Processing and Deep Learning Researcher at Intel Labs, currently focusing on generative AI efficiency. This blog is co-authored by Oren Pereg, a principal NLP Research leader at Intel Labs, and Xinyu Ye, an AI Frameworks Engineer at Intel; Daniel Rotem, Michael Hassid, and Prof. Roy Schwartz from the Hebrew University of Jerusalem.

Highlights:

Intel Labs and the Hebrew University of Jerusalem present SWEET, an adaptive Inference method for text classification at this year’s ACL conference.
The SWEET method favors high speedups for Early-Exit models and can be applied to various exit strategies, architectures, fine-tuning methods, etc.
SWEET outperforms both methods in the early section of the speed-accuracy tradeoff (high inference speed) while maintaining comparable results at slower speeds.
The SWEET code has been integrated as part of Intel Extension for Transformers (ITREX).

This week at the The 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), Intel Labs and the Hebrew University of Jerusalem presented Separating Weights for Early-Exit Transformers (SWEET), an adaptive Inference method for text classification. With SWEET, you will discover fascinating insights about the Multi-Model and Early-Exit approaches. This work has been done in the framework of academic collaboration between Dr. Roy Schwartz’s Lab at the Hebrew University of Jerusalem and Intel Labs.

Adaptive Inference

As NLP models get better and bigger, we look for ways to reduce their inference costs. One way is adaptive inference, where we run expensive models on “complex” instances and cheap models on “easy” ones, reducing the average inference cost.

There are two common approaches to doing so:

Multi-Model: leveraging multiple independent models of different sizes and inferring them sequentially until a prediction decision is made.
Early-Exit: in this method, the number of layers used for inference changes dynamically based on the input example. Each exit layer includes a classifier that produces a confidence score at inference time. If the score is higher than a threshold, then the final prediction is generated based on the output of this layer. All other layers are skipped, and the inference process speeds up.

Figure 1. Illustration of the adaptive inference approaches compared in this work. In both methods, multiple classifiers of increasing sizes are run serially, until a confident prediction is made. In Early-Exit (left), a single model with multiple classifiers is used, such that early computations are reused by later classifiers. In Multi-Model (right), a sequence of independent models is used, allowing each classifier to decouple its parameters from other classifiers.

Which Model is Better to Use?

Discoveries unfold as we investigate the fine-tuning process of these models. We reveal that the Early-Exit model weights are updated by multiple classifiers in different, often orthogonal directions. We name this phenomenon conflicting gradients.

Figure 2. Average cosine similarity between classifiers’ gradient updates of model layers. C-i stands for Classifier i. Layer 1 (preceding C-1) is updated by 3 classifiers, while layers 4 (preceding C-2), and 6 (C-3) are updated by 3 & 2 classifiers respectively. For each layer, the gradient update of the following classifier is roughly orthogonal to those of future classifiers, whereas gradient updates of higher classifiers tend to better align with one another.

Conflicting gradients prove detrimental to Early-Exit classifiers' performance. In fact, our study shows that Multi-Model’s individual classifiers (unaffected by conflicting gradients) outperform those of Early-Exit by an average of 2.3% across BERT/DeBERTA base/large models.

Figure 3. Results of individual classification layers averaged across all tasks using BERT as a backbone model. Milti-Model (MM) classifiers outperform their Early-Exit (EE) counterparts, with the gap being the largest for early classifiers. SWEET closes much of this gap, especially for early classifiers. Standard deviation (across random seeds) is reported in subscript.

Regarding the speed-accuracy tradeoffs, Multi-Model dominates the faster parts of the curve, leveraging the superiority of its individual classifiers. Early-Exit excels at lower speeds by avoiding the overhead of running independent models sequentially.

The question that arises is: can we combine the advantages of both methods?

Figure 4. Speed-accuracy trade-off comparison of Multi-Model and Early-Exit. Multi-Model performs better only at fast inference times (up to 1/4 of original run time), while Early-Exit dominates the remainder of the range. The graph shows the average task scores (y-axis) as a function of the speedup ratio (x-axis).

SWEET

Based on these findings, we introduce SWEET (Separating Weights in Early-Exit Transformers), a novel fine-tuning method that combines the strengths of Early-Exit and Multi-Model, bypassing their limitations. We train an Early-Exit architecture (no overhead) so that the layers are updated only by the following classifier (no conflicting gradients).

Figure 5. Left: standard Early-Exit fine-tuning, where lower layers get gradient updates from multiple classifiers. Right: our SWEET method, in which each layer parameters are updated only by the next classifier.

Results

SWEET outperforms both methods in the early section of the speed-accuracy tradeoff (high inference speed) while maintaining comparable results at slower speeds. For BERT Large, SWEET outperforms both baselines across the entire span.

We are excited to see our results motivate further research into fine-tuning algorithms tailored to the unique Early-Exit architecture.

Figure 6. Speed-accuracy tradeoff averaged across tasks. SWEET matches the performance of Multi-Model at fast speeds, while maintaining results comparable to Early-Exit at slow speeds.

Takeaways:

Conflicting gradients exist during training of Early-Exit models
Comparing Early-Exit and Multi-Model methods, we conclude that individual Multi-Model classifiers are better, but regarding the entire speed-accuracy curve, Early-Exit is still favorable to Multi-Model
The SWEET method favors high speedups for Early-Exit models and can be applied to various exit strategies, architectures, fine-tuning methods, etc.

For more details and additional results, check out the full paper on arXiv. SWEET code has been integrated in Intel Extension for Transformers (ITREX) version 1.01.

All the figures are taken from the ACL’23 published paper.