Multi-Model, Hardware-Aware Train-Free Neural Architecture Search

Tianyi_Liu · ‎04-27-2023

Introduction

Neural Architecture Search (NAS)^[1] is becoming an increasingly important technique for automatic neural architecture engineering because automatically constructed networks are on-par or outperform manually-designed architectures. Recently, NAS approaches have made great progress in various domains. Convolutional Neural Network (CNN) with NAS is widely used in Computer Vision (CV) domain^[2] while NAS with Recurrent Neural Network (RNN) has reached state-of-art performance in Natural Language Processing (NLP) domain^[3]. This demonstrates the superiority of automated neural architecture design.

In this blog, we introduce the challenges of traditional NAS solutions and propose a multi-model, hardware-aware, train-free NAS to resolve those challenges. It includes a unified transformer-based search space, a hardware-aware search strategy, and a train-free score method. We also present how this solution improves the multi-domain models' performance on commodity CPU clusters.

Motivations

Challenges of Neural Architecture Search

Conventional NAS suffers from many challenges. First, it was mostly targeting a single domain, which leveraged specific domain knowledge to construct a unique search space for the target task, resulting in poor cross-domain generalization ability. Second, traditional NAS consumed huge computation power. Multiple innovative approaches were proposed to address this problem, such as one-shot NAS approaches using a light performance predictor, or zero-cost proxies for performance estimation to reduce the evaluation cost of the candidate network^[4]^[5]. However, those approaches were limited by the utilization of network characteristics. Finally, traditional NAS solutions were hard to be deployed in resource-constraint hardware.

To summarize, the disadvantages of conventional solutions are:

Previous NAS adopts one unique search space and supernet for one domain-specific task, which is hard to be adapted in other domains and thus does not have good generalizability.
The conventional NAS approaches are hardware-unaware, which poses deployment difficulties, especially for resource constraint devices such as edge devices.
The existing performance predictor in the traditional NAS frameworks is computationally intensive, and not suitable for commodity hardware.
Conventional NAS performance predictor is training-based and performance evaluation on data, which is particularly challenging for tasks with huge datasets.
Existing zero-cost proxies in the train-free NAS^[4]^[5] have some limitations. They only consider one or two perspectives of neural network’s characteristics and do not cover a compressive list of neural network’s characteristics, such as the network’s expressivity, complexity, saliency, diversity, and latency. This makes it perform better for a specific task but worse in other tasks or domains.

Multi-Model, Hardware-Aware, Train-Free NAS

To resolve those challenges, we proposed a multi-model, hardware-aware, train-free NAS named DE-NAS to construct compact model architectures for the target platform directly. First, DE-NAS constructs a transformer-based supernet from the unified search spaces for multiple domains. Then, a hardware-aware train-free score method is proposed to evaluate the performance of the candidate architecture without training, DE-NAS search structures by maximizing the score with different pluggable search strategies. Finally, the best-generated network is trained on domain-specific data or metadata to generate the best model that satisfies the given hardware requirements such as latency or FLOPS. Figure 1 shows the DE-NAS architecture.

DE-NAS Architecture.png

Figure 1. DE-NAS Architecture.

Unified Search Space

DE-NAS Search Space.png

Figure 2. DE-NAS Search Space and Supernet Configuration.

One key innovative component in DE-NAS is a unified search space that supports multiple models. It leverages transformer-based search space to support different models and eliminates the need to develop new search space for new domains or tasks, thus significantly improving developer productivity. Currently, Computer Vision (CV), Natural Language Processing (NLP), and Automatic Speech Recognition (ASR) search spaces are supported. The details of the search space are shown in Figure 2.

DE-NAS Unified Transformer.png

Figure 3. DE-NAS Unified Transformer.

Based on the characteristics of different domains, we construct the corresponding supernet with a unified skeleton as shown in Figure 3. The building blocks of the transformer are shared across the domains, with different configurations for different domains (i.e., the number of transformer layers). This unified architecture leverages the capabilities of transformer structures and eliminates the duplicated efforts to develop different supernets for NAS.

Hardware-Aware Search Strategy

We implement a search strategy that generates candidate architectures that maximize the DE-Score using pluggable search strategies that innovatively integrate latency into a train-free DE-Score as an indicator. Hardware-aware is implemented from two perspectives: (1) A threshold for model parameter size and model latency, and the underlying search engine will filter out the models suitable for the hardware platform in a coarse-grained manner. (2) Integrate latency metrics into the train-free score, which generates latency-guaranteed architectures for a given platform.

Algorithm: Hardware-aware Evolution Algorithm Search

Require: number of iterations N, number of architectures to sample K, search space S, size of mutation Nm, size of crossover Nc, mutation probability p, parameter threshold Ts, latency threshold Tl

1: P0 = Initial_population(K, S, Ts,Tl)
2: Topk = Φ
3: for i = 1:N do
4: DE-Score=DE-Score (P(i-1))
5: Topk=update_topk(P(i-1), DE-Score, Topk)
6: P_mutation=Mutation(Nm, S, Ts, Tl, p ,Topk)
7: P_crossover=Crossover(Nc, S, Ts, Tl, Topk)
8: Pi=P_mutation ∪ P_crossover
9: end
10: Return the network architecture with the best DE-Score in Topk

Table 1. Hardware-aware search strategy.

Evolution Algorithm (EA) is shown in Table 1 as an example of the hardware-aware search algorithm. The parameter and latency thresholds are fed into the mutation and crossover functions to select the valid candidates. If the parameter number and inference time of the candidate calculated on the target hardware satisfy the threshold requirement, it will be selected as one candidate architecture that is suitable to the hardware.

Train-Free Score

The train-free score uses an innovative, zero-cost “proxy” to predict model accuracy instead of full training and validation. It used a novel zero-cost metric combined with Gaussian complexity based on network expressivity, NTK^[6] score based on network complexity, nuclear norm score based on network diversity, Synflow^[7] score based on network saliency, and latency score on network inference latency. The computation of DE-Score only takes a few forward inferences other than iterative training, making it extremely fast, lightweight, and data-free. The overall DE-Score was calculated as the following equation:

Example Usage

This section introduces the usage of DE-NAS APIs, and the following is a step-by-step DE-NAS for NLP BERT example:

Step 1. Define a configuration file for DE-NAS

User Configuration of DE-NAS.png

Figure 4. User configuration of DE-NAS for NLP BERT.

As shown in Figure 4, the above yaml-format file describes the DE-NAS search configuration on BERT. It includes the type of search engine, search hyper-parameter (i.e., batch_size, select_num and population_num), DE-Score parameters (i.e., expressivity score weight and latency weight), and supernet/search space configuration (i.e., model_type).

Step 2. Import related modules from Intel E2E AI Optimization kit

from e2eAIOK.DeNas.search.utils import parse_config, load_best_structure
from e2eAIOK.DeNas.nlp.supernet_bert import SuperBertModel
from e2eAIOK.DeNas.nlp.utils import generate_search_space
from e2eAIOK.DeNas.search.SearchEngineFactory import SearchEngineFactory
from e2eAIOK.DeNas.train import Trainer

Step 3. Load user configuration of DE-NAS

params = parse_config('e2eaiok_denas.conf')

Step 4. Construct the supernet and search space

super_net = SuperBertModel.from_pretrained(params)
search_space = generate_search_space(params["SEARCH_SPACE"])

Step 5. Instantiate a DE-NAS searcher

searcher = SearchEngineFactory.create_search_engine(params = params, super_net = super_net, search_space = search_space)

Step 6. Start the DE-NAS search process

searcher.search()
best_structure = searcher.get_best_structures()
print(f"DE-NAS completed, best structure is {best_structure}")

Step 7. Train and evaluate the DE-NAS searched model

model = load_best_structure(best_structure) 
trainer = Trainer(cfg, model) # create DE-NAS trainer
trainer.fit() # trigger the training process

As shown in the above steps, the user can easily utilize DE-NAS to construct a compact network structure “best_structure” that best suits the given platform’s latency & parameter size requirement.

Performance

To evaluate DE-NAS performance, we designed several experiments to demonstrate: (1) whether the DE-NAS can generate multiple-domain models on the target hardware, (2) how it performs compared with multiple stock models on different domains, and (3) how it performs compared with the conventional NAS work.

System Configurations

The tests were conducted on a four-node cluster, each equipped with two Xeon Platinum 8358 CPU and 512GB memory, the nodes were connected through 40Gb Ethernet. One 1TB P4500 NVMe SSD was deployed as a data drive, the detailed configuration was listed in Table 2.

Configuration	Details
Platform	NFS2580M6
CPU	Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Number of Nodes	4
CPU Per Node	32core/socket, 2 sockets, 2 threads/core
Memory	512GB (16x32GB DDR4 3200 MT/s)
Storage	1x 240GB INTEL SSDSCKKB24, 1x 1TB INTEL SSDPE2KX010T8
Network	MT27700 Family [ConnectX-4]
Microcode	0xd000363
BIOS Version	06.00.05
OS/Hypervisor/SW	Red Hat Enterprise Linux 8.6

Table 2. System Configuration.

Test Methodology

To evaluate DE-NAS performance, we designed the following test cases:

DE-NAS comparison with stock models: performance comparison of DE-NAS models and multi-domain stock models.
DE-NAS comparison with SOTA NAS: performance comparison of DE-NAS and SOTA NAS (Zen-NAS and Autoformer).
DE-Score effectiveness validation: Spearman correlation of DE-score and F1 score on the NLP BERT models.

The detailed software configuration was shown in Table 3.

	CNN	ViT	NLP	ASR
Framework	Pytorch 1.12.0 IPEX 1.12.100
Base model	Stock model Resnet50 SOTA NAS ZenNas	Stock model AutoFormer SOTA NAS Autoformer	Stock model BERT-Base	Stock model RNNT
Libraries	OneDNN 2022.2.0
Dataset	CIFAR10	CIFAR10	SQuAD v1.1	LibriSpeech
Precision	FP32
Docker Build Flags	e2eaiok/e2eaiok-pytorch112
KMP AFFINITY	granularity=fine, compact, 1, 0
OMP_NUM_THREADS	28 (2 processes per node)
Target Metrics	Acc 0.94	Acc 0.94	F1 Score 87.71	WER 0.058
Training Methodology	200 epochs	200 epochs	2 epochs	Early stop at 5.8% WER
Command Line Used	python -u search.py --domain [CNN, bert, asr] --conf $CONFIG_FILE python train.py --domain [CNN, bert, asr] --conf $CONFIG_FILE

Table 3. Software Configuration.

Experimental Results

Overview

DE-NAS Overall Performance.png

Figure 5. DE-NAS overall performance on multi-domain models.

Figure 5 showed DE-NAS overall performance (for CV, NLP and ASR models). It showed that DE-NAS searched CNN, ViT, NLP and ASR models delivered 9.86x, 4.44x, 7.68x and 59.12x training speedup over ResNet50, AutoFormer, Bert-base and RNN-T, with owning smaller footprint and similar accuracy.

Comparison with SOTA NAS

DE-NAS Comparison with SOTA NAS.png

Figure 6. DE-NAS performance comparison with SOTA NAS.

As shown in Figure 6, DE-NAS CNN delivered 40.73x search and 82.57x training speedup over SOTA NAS (Zen-NAS^[4]) with 38% model size reduction and 5% accuracy. DE-NAS ViT delivered 35.63x search and 4.44x training speedup over SOTA NAS (AutoFormer^[9]) with a 5% accuracy loss.

Spearman Correlation Ecoefficiency

Spearman Correlation.png

Figure 7. Spearman Correlation of DE-Score and F1 Score with and without Latency Score for NLP Domain.

In Figure 7, the Spearman rank correlation coefficient was used to measure the correlation between DE-Score and model accuracy. For instance, in the NLP domain, the Spearman correlation on 100 network candidates showed a positive correlation 0.40/0.57 of DE-Score with BERT F1 score with and without latency, which demonstrated the effectiveness of DE-Score to evaluate the model performance and model efficiency without any training and validation process.

Call to Action

As one key component in Intel® End-to-End AI Optimization Kit, DE-NAS is a hardware-aware, train-free neural architecture search solution that enables users to construct optimized neural architecture for their specific hardware platform under a given search budget. DE-NAS leverages a zero-cost “proxy” to predict model accuracy instead of full training and validation based on multiple neural networks’ characteristics, and demonstrates very promising results over stock models, SOTA NAS and excellent correlation coefficient with training accuracy. If you want to have a trial for your problem, please visit https://github.com/intel/e2eAIOK^[8] repo for more information.

Reference

Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. “Neural architecture search: A survey.” The Journal of Machine Learning Research 20.1 (2019): 1997-2017.
Dong, Xuanyi, and Yi Yang. “Nas-bench-201: Extending the scope of reproducible neural architecture search.” arXiv preprint arXiv:2001.00326 (2020).
Klyuchnikov, Nikita, et al. "NAS-Bench-NLP: neural architecture search benchmark for natural language processing." IEEE Access 10 (2022): 45736-45747.
Lin, Ming, et al. “Zen-nas: A zero-shot nas for high-performance image recognition.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Zhou, Qinqin, et al. “Training-free Transformer Architecture Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Lee, Jaehoon, et al. “Wide neural networks of any depth evolve as linear models under gradient descent.” Advances in neural information processing systems 32 (2019).
Tanaka, Hidenori, et al. “Pruning neural networks without any data by iteratively conserving synaptic flow.” Advances in Neural Information Processing Systems 33 (2020): 6377-6389.
https://github.com/intel/e2eAIOK
Chen, Minghao, et al. "Autoformer: Searching transformers for visual recognition." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software, or service activation.