Enhance Your AI Pipeline through Efficient Knowledge Transfer with Model Adapter

Jian_Zhang · ‎06-28-2023

Authors:

Jian Zhang, Intel Corporation

Xinyao Wang, Intel Corporation

Yu Zhou, Intel Corporation

Introduction

Intel® End-to-End AI Optimization Kit [1] is a composable toolkit developed and open-sourced by Intel to make the end-to-end AI pipeline faster, simpler, and more accessible, broadening AI access to everyone and everywhere. One of the key components is Model Adapter (MA), aiming at constructing neural network models with the knowledge transferred from publicly available models/datasets while reducing end-users’ training and deployment costs.

Training large models from scratch can be extremely computationally intensive. Training and deploying large models comes with several challenges: (1) massive parameters to training; (2) huge data labeling efforts o; (3) high requirements on hardware for deployment.

With MA, we implement a unified framework to support efficient knowledge transfer and reduce training, labeling, and deployment costs. MA is embedded with three technologies: Fine-tuning, knowledge distillation, and domain adaption. Additionally, we compared the performance of MA optimized models with naïve models on pure CPU devices. Our test results show that MA demonstrates solid training speedup and reduced labeling costs.

Motivations

Challenges of Training and Deploying Large Models

With the development of deep learning techniques, the size of advanced models is getting larger and larger. Fig.1. shows the parameter size of large-scale models for Natural Language Processing (NLP) tasks and data size has increased by 10 times per year [2]. For example, GPT-3 model has 175B parameters and is trained on 500B dataset.

These models, while achieving state-of-the-art results, are only available for big companies.

Applying these models is a big challenge for most users:

The cost of training large models from scratch can be extremely high. Training these advanced models requires a large amount of labeled data. For example, the widely used ImageNet-1k dataset has 1.28M labeled images[11].
Hardware with limited resources, e.g., mobile devices, cannot enjoy the benefits of large models.

Fig.1. The model size and data size applied by recent NLP models [2].
(A base-10 log scale is used for the figure)

Knowledge Transfer from Public Models/Datasets

Over time, increasingly publicly available pre-trained models and labeled datasets are emerging on the Internet. For example, TensorFlow and PyTorch provide several pre-trained models for image classification. Hugging Face [4] provides many transformer-based models, which are pre-trained on large-scale datasets, along with many public datasets.

It would be ideal to utilize these resources and transfer knowledge from the available pre-trained models or labeled datasets through several state-of-the-art technologies. Fortunately, transfer learning technology, such as fine-tuning, knowledge distillation, and domain adaption, has been developed to take advantage of these pre-trained resources, optimizing training and deployment.

Model Adapter: A Unified Framework for Knowledge Transfer

Currently, many toolkits focus on pre-training & fine-tuning, knowledge distillation, or domain adaption separately. The Model Adapter toolkit combines all of these technologies while maximizing the capability of transfer learning. The main architecture is shown in Fig.2, which is a general framework easy to utilize, publicly available pre-trained models and datasets.

There are three key modules in Model Adapter: Fine-tuner for pre-training & fine-tuning, Distiller for knowledge distillation, and Domain Adapter for domain adaption. Each module shares a unified API and can be easily integrated with existing pipelines with few code modifications. Additionally, Model Adapter makes additional efforts on CPU optimization of training and inference, both on single-node and distributed modes.

Fig.2. Model Adapter Overview

Fine-tuner with Fine-tuning Technology

There are many pre-trained, publicly-available models on the Internet that can be leveraged. Fine-tuner is focused on transferring knowledge from these pre-trained models to target models. It needs only a few iterations during fine-tuning to converge and can even achieve better accuracy than training from scratch. The module contains two stages: pre-training and fine-tuning, as Fig.3 shows.

Pre-training stage: a large model is trained on a large dataset.
Fine-tuning stage: the target model is initialized layer-wisely by a pre-trained large model and is trained on a target dataset with a few iterations.

Fig.3. Fine-tuner Architecture

By enabling Fine-tuner, we significantly reduced the training cost as well as the amount of labeled data for advanced models. However, this mechanism has two limitations: 1) Lack of flexibility: the model architecture of the pre-trained model must be unchanged during fine-tuning. 2) Labeled target data requirement: labeled data must be provided during fine-tuning, which may be difficult to collect for some tasks.

To solve the first problem, we can turn to another transfer learning technology: distillation. The second problem can be solved with domain adaption technology.

Distiller with Knowledge Distillation Technology

To leverage the advantage of large-scale and pre-trained models, knowledge distillation technology is proposed in [5], which transfers knowledge from a heavy model ( teacher model) to a light one (student model) to improve the light model’s performance without introducing extra cost. Based on the technology, we developed the “Distiller” module in Model Adapter, which included the easy-to-use implementation of knowledge distillation algorithms.

The Distiller architecture is shown in Fig.4. For a target dataset, first, we prepare a well-performed teacher model, which can be easily downloaded from an existing rich resource library, such as Hugging Face and Timm [6]. We then froze the weights of the teacher model and only trained the student model. The output of the teacher model served as a soft label, and the student model learned to fit the soft label to perform knowledge transferring. We used a total loss to combine the soft label and hard label (aka. ground truth), which can help student models quickly coverage to a better state.

Fig.4. Distiller Architecture

With Distiller, users can gain the following benefits:

Enjoy the benefits from the rich resources of pre-trained models.
Achieve better performance with smaller structures.
Can transfer the knowledge to any other model with the same output space, regardless of whether their model architectures are the same.

Distiller also has limitations: Since the teacher and student model must be trained on the common dataset, finding a larger model pre-trained on the same dataset is challenging. If using a different dataset, domain adaption technology can be leveraged.

Domain Adapter with Domain Adaption Technology

The demand to transfer the knowledge from a labeled source domain to an unlabeled target domain exists all the time. However, models still suffer from performance degradations due to distribution shift between source and target domains. Domain adaption is proposed to solve this problem. Many methods have been proposed for domain adaption, aimed at embedding both source domain data and target domain data into a common representation space to make them similar.

The “Domain Adapter” module and its architecture are shown in Fig.5, which is similar to that of Domain-Adversarial Neural Networks (DANN) [7]. In Domain Adapter, the classification loss guide model to achieve accurate prediction on source domain data; meanwhile, the discrimination loss forces the model to learn the similar representation of the source and target domain, which make the model can’t distinguish which domain the sample belongs to.

Fig.5. Domain Adapter Architecture

With the help of domain adaption, we can transfer the knowledge of the source domain to the target domain data, while requiring fewer labels or even no labels. However, domain adaption also has some limitations: (1) it will induce accuracy regression in some cases; (2) it highly depends on the source-domain dataset, which might not be available for some tasks.

Performance Evaluation

System Configurations

Configuration	Details
Test Date	Test by Intel as of 02/2023
Manufacturer	Inspur
CPU	Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
# of Nodes	1 for Fine-tuner, 4 for Distiller/Domain Adapter
CPU per node	32 cores/socket, 2 sockets, 2 threads/core
Memory	512GB (16x32GB DDR4 3200 MT/s [3200 MT/s])
Storage	1x 240GB INTEL SSDSCKKB24, 1x 1TB INTEL SSDPE2KX010T8
Network	MT27700 Family [ConnectX-4]
PyTorch	1.12.0

Testing Methodology

For Fine-tuner and Distiller, we took image classification as an example, it shows how to establish a good classification model on CIFAR100 dataset with the help of ResNet50 pre-trained on ImageNet21K dataset [8]. In Fine-tuner, we compared two ResNet50 models, fine-tuned one, and trained from scratch one on CIRAR100 dataset on the training time when achieving target classification accuracy (0.7841). In Distiller, we applied knowledge distillation from pre-trained ResNet50 to ResNet18, and compared it with naïve ResNet18 trained from scratch on CIRAR100 dataset on the training time when achieving target classification accuracy (0.763).

For domain adapter, we take semantic segmentation as an example. This showed how to transfer knowledge from the source AMOS22 [9] dataset to the target KiTS19 [10] dataset. Specifically, our task was to explore semantic segmentation methodologies for unlabeled KiTS dataset with the help of labeled AMOS dataset. We wanted to compare with the 3D-UNet model that was trained from scratch on the KiTS19 dataset on the training time when achieving the target dice score (0.902).

Overall Performance of Model Adapter

All three modules delivered over 10 times the training time acceleration. As showed in Fig.6: Fine-tuner delivered 168x acceleration for training time on ResNet50, Distiller delivered 11x acceleration for training time on the CIFAR100 dataset, and Domain Adapter with 3D-UNet model delivered 20x acceleration for training time on the KITS19 dataset, with 0.4% dice score regression.

Fig.6. Overall Performance of Model Adapter

Training Convergence and Label Efficiency of Model Adapter

Fig.7a. Training Convergence of Fine-tuner and Distiller

The training convergence process is plotted in Fig.7.: the x-axis is the training epoch, and the y-axis is the evaluation metric. We can see that the models optimized by Model Adapter converge much faster than the stock model. As shown in Fig.8., when increasing the label ratio of the target domain, the dice score of the domain adapter could increase correspondingly. And only with a 20% label we could get a satisfied dice score (less than 2% dice score regression).

Fig.7b. Training Convergence of Domain Adapter Fig.8. Dice Score of Domain Adapter over Ratio of Label

Call to Action

Model Adapter is one of the components for e2eAIOK, and there are blogs for other components of e2eAIOK, i.e., AIOK overview, DE-NAS. Please star the e2eAIOK repo and stay tuned for the next post.

You can also use Intel Developer Cloud where developers can test their software examples and models from anywhere in the world, as well as testing before moving into production. Access the latest Intel CPUs, GPUs, FPGAs, and software.

Go to Intel Developer Cloud to learn more and sign up.

Reference

[1] https://github.com/intel/e2eaiok

[2] Han Xu, Zhang Zhengyan, Ding Ning, Gu Yuxian, Liu Xiao, Huo Yuqi, Qiu Jiezhong, Zhang Liang, Han Wentao, Huang Minlie, et al. 2021. Pre-trained models: Past, present and future.
arXiv preprint arXiv:2106.07139

[3] http://jalammar.github.io/how-gpt3-works-visualizations-animations/

[4] https://huggingface.co

[5] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

[6] Timm: PyTorch image models, scripts, pretrained weights

[7] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaption by backpropagation. In ICML, pages 325–333, 2015

[8] https://github.com/Alibaba-MIIL/ImageNet21K

[9] https://amos22.grand-challenge.org/

[10] https://github.com/neheller/kits19

[11] https://www.image-net.org/

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.

Originally published June 28, 2023 - Updated August 17, 2023.