Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
412 Discussions

Best Practices for Text Classification with Distillation (Part 2/4) – Challenging Use Cases

0 0 638

Published May 26th, 2021

Moshe Wasserblat is currently the research manager for Natural Language Processing (NLP) at Intel Labs.


In the first post, I showed how to achieve surprisingly good performance by distilling large (teacher) models into tiny (student) models whose text-classification performance is on par with the mammoth Transformer models. In this blog, I intend to explore this method further and investigate other test classification datasets and sub-tasks in an effort to duplicate these results.

To that end, I chose SST-2 and CoLA, which are popular single-sentence classification datasets and are part of the widely used General Language Understanding Evaluation (GLUE) benchmark.

SST-2 Dataset

SST-2, the Stanford Sentiment Treebank 2, comprises single sentences extracted from movie reviews and binary (positive/negative) sentiment classification labels.

Here are the results of using several transformer models, multilayer perceptron (MLP), and the distilling of the RoBERTa model into MLP.




Alas, we notice a drop in accuracy of 8% in DistilMLP compared to BERT-base.

This time, the tiny MLP model failed to learn the full capacity of knowledge that would enable it to decode the classification task as well as the teacher model.

Let’s try replacing our student model with one that has a much deeper architecture but is still significantly smaller than BERT: Bi-LSTM with 0.66M parameters (167x smaller).

Here are the results of the Bi-LSTM and distilled model.




Not bad! We only have a 2% drop in accuracy compared to the BERT-Base model, and it’s almost on-par with DistilBERT.

So, for SST-2, a simple Bi-LSTM student model would be considered sufficient for production purposes.

The following figure summarizes the results that we achieved so far for SST-2 in terms of (model acc./BERT acc.)% vs. model size.


The CoLA Dataset and Task

CoLA, the Corpus of Linguistic Acceptability, consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each sentence is associated with a label that indicates whether it is a grammatical English sentence or not.

Here are the results for the teacher model (BERT) and the two student distilled models:




Not good! Unlike its performance in classifying the SST-2 dataset, even the Bi-LSTM cannot match BERT on the CoLA task. Both the Bi-LSTM and the MLP models are far from performing on par with BERT. In the case of the SST-2 task, Bi-LSTM, which is larger and deeper than MLP, closed the gap when distilled from the teacher model (RoBERTa).

Let’s try replacing our student model with one that has a much deeper architecture compared to Bi-LSTM but is still smaller than BERT: TinyBERT6, a pruned version of BERT (67M parameters), based on the self-distillation of logits, attention and embeddings.

We get:

MCC = 51.1%

The MCC is significantly better than Bi-LSTM with only <2% drop compared to BERT and would be considered sufficient for production purposes.

The following figure summarizes the results that we achieved so far for CoLA in terms of (model MCC/BERT MCC)% vs. model size.



What’s the intuition behind these results?

 In general, we showed that it is feasible to distill BERT using very efficient models while preserving comparable results. However, the success of the distillation (student model size vs. accuracy), depended on the dataset and task at hand.

What is the reason for such variance in performance?

To answer this question, we need to take a deeper look into the different datasets:




A successful classification of the Emotion task seems to be heavily dependent on verbal cues including salient emotional words representing the emotional category regardless of structure and syntax. The CoLA task, on the other hand, is inherently dependent on the syntax structure and less on linguistic cues. And the SST-2 task seems to be dependent on a mix of lexical and syntactic clues.

So, to successfully distill the full teacher knowledge required for a given task, the student model architecture must absorb the teacher model’s capacity related to the task.

The MLP model does not have the capacity to learn the full syntactic information stored in the teacher model because its architecture is based on the bag-of-word (BOW) model embedding implementation. It, therefore, performs very poorly when classifying CoLA and middling when classifying SST-2. On the other hand, the classification of the Emotion task requires only the learning of mostly semantic lexical knowledge. Therefore, MLP is sufficient, and Bi-LSTM and DistilBERT are over-qualified.

The Bi-LSTM model does have an inherent structure to capture syntax, so why does it fail in the CoLA task? I don’t have a clear answer but following are my intuitions:

1. The deployed Bi-LSTM structure is relatively simple (only 0.66M) so potentially increasing its internal embedding size and adding more layers will improve its capacity.

2. In the CoLA case, the transformer has to utilize its fullcontextual capacity (learned from massive amount of data during pre-training). Still, the Bi-LSTM structure, regardless, of its size is limited to hold such vast knowledge. See OpenAI’s “Scaling Laws for Natural Language Models.” 

3. Another point is that the efficacy of knowledge distillation is also dependent on the availability of a large amount of (unlabeled) data representing the task. In the case of SST-2 and CoLA, we had to generate training data using different data augmentation techniques. The augmentation of CoLA’s grammatical acceptance task is very challenging. It seems we were not able to generate high-quality data that represents the complete task’s data distribution enabling more effective distillation.

All the above are very promising research directions.

To summarize: Very simple and efficient models can successfully distill classification tasks that require capturing general lexical semantics cues. However, classification tasks that require the detection of linguistic structure and contextual relations are more challenging for distillation using simple student models.

In the next blog I will suggest a simple metric for estimating the success of distillation, and then suggest practical tips for training an efficient model for text classification.

[1] GLUE/Leaderboard



Special thanks to Jonathan Mamou for the great contribution in exploring data augmentation for distillation.

Tags (1)
About the Author
Mr. Moshe Wasserblat is currently Natural Language Processing (NLP) and Deep Learning (DL) research group manager at Intel’s AI Product group. In his former role he has been with NICE systems for more than 17 years and has founded the NICE’s Speech Analytics Research Team. His interests are in the field of Speech Processing and Natural Language Processing (NLP). He was the co-founder coordinator of EXCITEMENT FP7 ICT program and served as organizer and manager of several initiatives, including many Israeli Chief Scientist programs. He has filed more than 60 patents in the field of Language Technology and also has several publications in international conferences and journals. His areas of expertise include: Speech Recognition, Conversational Natural Language Processing, Emotion Detection, Speaker Separation, Speaker Recognition, Information Extraction, Data Mining, and Machine Learning.