Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
413 Discussions

Best Practices for Text-Classification with Distillation Part (3/4) - Word Order Sensitivity (WOS)

0 0 1,248

Published June 8th, 2021

Moshe Wasserblat is currently the research manager for Natural Language Processing (NLP) at Intel Labs.


In the last two posts, we saw useful examples of text classification distillation methods. I provided some intuition as to why and when distillation works and how to choose the smallest student model that can match the capacity of its teacher model.

In this post, I introduce a metric for estimating the complexity level of your dataset and task, and I describe how to utilize it to optimize distillation performance. I will finish this blog with a few practical tips for training an efficient model for text classification.

Word Order Sensitivity

In “easy” instances, we noted that prediction is mostly based on linguistic semantic cues and seems to be rather agnostic to syntax or word order. Thang et al., 2020 went a step further. It showed a surprising phenomenon: between 75% and 90% of the correct predictions of Transformer-based classifiers trained on General Language Understanding Evaluation (GLUE) tasks remained unchanged when the input words were randomly shuffled! The authors further suggested a simple metric to measure a dataset’s sensitivity to word order:

WOS (Word Order Sensitivity) = (100-p)/50, where p is the accuracy of a task-trained model evaluated on a dev set (See Thang’s Sec 3.1 and 2.3.2).

Here is a figure taken from Thang et al., showing the WOS scores plotted for various GLUE tasks, followed by a table that presents our measure of the RobBERTa’a WOS score for the Emotion, SST-2, and CoLA task (1-gram shuffling):





Our intuition, discussed in the previous blog, was correct! The CoLA dataset with its average WOS score of 0.99 indeed consists of a vast majority of “hard” samples. In contrast, the Emotion dataset with the lowest WOS score consists of a vast majority of “easy” samples.

See the following table for some “easy” and “hard” instance examples taken from the SST-2 dataset:




The SST-2 WOS score of 0.34 means that it tends to have more “easy” instances than “hard” ones. These results are pretty consistent with the distillation performances (student model size vs. accuracy). The tiny distilled MLP model successfully classified the Emotion dataset, SST-2 by the deeper distilled Bi-LSTM model and CoLA by the even deeper TinyBERT6 model.

IMDB Example

Let’s apply our new metric to the popular Internet Movie Database (IMDB) dataset and try to predict the distillation results. The IMBD comprises single sentences extracted from informal movie reviews for binary (positive/negative) sentiment classification.

The training data consists, for this example, of a subset of 1K randomly selected samples from the 25K training samples and 25K test samples.

The WOS score of IMDB is 0.28. Since this WOS is relatively low (<0.3), we anticipate that a distilled MLP or Bi-LSTM model should be sufficient for absorbing the capacity of its teacher model (RoBERTa in this case).

Here are the results of the IMDB dataset/task classification, followed by a figure summarizing the results in terms of (model acc./BERT acc.)% vs. model size.





So yes, as we predicted, a small MLP model is capable of absorbing RoBERTa’s knowledge for the IMDB dataset/task and even outperforms DistilBERT.

In Summary

I would like to suggest a few steps for text classification with distillation deployed in production:

  • Set baseline results with a simple classifier (e.g., logistic, FastText)
  • Compare Transformers performance with baseline results
  • Estimate your data complexity based on WOS
  • If WOS is low (<0.35), consider distillation for tiny models that shine on your hardware (e.g., MLP, Bi-LSTM, TCN, CNN).
  • In other cases (WOS>0.35), you may consider several options:
  1. Transformers’ Quantized/Sparse/pruned techniques - See FastFormers, Hugging Face’s nn_pruning, and Intel’s LPOT
  2. Deploy TinyBERT6/DistilBERT for mid compression (x2) or TinyBERT4/MobileBERT in cases of memory constraints (>x7)

In my next post, I will further exploit the WOS metric and propose new conceptual Transformer architecture that benefits from distillation. Stay tuned!

Tags (1)
About the Author
Mr. Moshe Wasserblat is currently Natural Language Processing (NLP) and Deep Learning (DL) research group manager at Intel’s AI Product group. In his former role he has been with NICE systems for more than 17 years and has founded the NICE’s Speech Analytics Research Team. His interests are in the field of Speech Processing and Natural Language Processing (NLP). He was the co-founder coordinator of EXCITEMENT FP7 ICT program and served as organizer and manager of several initiatives, including many Israeli Chief Scientist programs. He has filed more than 60 patents in the field of Language Technology and also has several publications in international conferences and journals. His areas of expertise include: Speech Recognition, Conversational Natural Language Processing, Emotion Detection, Speaker Separation, Speaker Recognition, Information Extraction, Data Mining, and Machine Learning.