Introduction
Neural Architecture Search (NAS)[1] is becoming an increasingly important technique for automatic neural architecture engineering because automatically constructed networks are on-par or outperform manually-designed architectures. Recently, NAS approaches have made great progress in various domains. Convolutional Neural Network (CNN) with NAS is widely used in Computer Vision (CV) domain[2] while NAS with Recurrent Neural Network (RNN) has reached state-of-art performance in Natural Language Processing (NLP) domain[3]. This demonstrates the superiority of automated neural architecture design.
In this blog, we introduce the challenges of traditional NAS solutions and propose a multi-model, hardware-aware, train-free NAS to resolve those challenges. It includes a unified transformer-based search space, a hardware-aware search strategy, and a train-free score method. We also present how this solution improves the multi-domain models' performance on commodity CPU clusters.
Motivations
Challenges of Neural Architecture Search
Conventional NAS suffers from many challenges. First, it was mostly targeting a single domain, which leveraged specific domain knowledge to construct a unique search space for the target task, resulting in poor cross-domain generalization ability. Second, traditional NAS consumed huge computation power. Multiple innovative approaches were proposed to address this problem, such as one-shot NAS approaches using a light performance predictor, or zero-cost proxies for performance estimation to reduce the evaluation cost of the candidate network[4][5]. However, those approaches were limited by the utilization of network characteristics. Finally, traditional NAS solutions were hard to be deployed in resource-constraint hardware.
To summarize, the disadvantages of conventional solutions are:
- Previous NAS adopts one unique search space and supernet for one domain-specific task, which is hard to be adapted in other domains and thus does not have good generalizability.
- The conventional NAS approaches are hardware-unaware, which poses deployment difficulties, especially for resource constraint devices such as edge devices.
- The existing performance predictor in the traditional NAS frameworks is computationally intensive, and not suitable for commodity hardware.
- Conventional NAS performance predictor is training-based and performance evaluation on data, which is particularly challenging for tasks with huge datasets.
- Existing zero-cost proxies in the train-free NAS[4][5] have some limitations. They only consider one or two perspectives of neural network’s characteristics and do not cover a compressive list of neural network’s characteristics, such as the network’s expressivity, complexity, saliency, diversity, and latency. This makes it perform better for a specific task but worse in other tasks or domains.
Multi-Model, Hardware-Aware, Train-Free NAS
To resolve those challenges, we proposed a multi-model, hardware-aware, train-free NAS named DE-NAS to construct compact model architectures for the target platform directly. First, DE-NAS constructs a transformer-based supernet from the unified search spaces for multiple domains. Then, a hardware-aware train-free score method is proposed to evaluate the performance of the candidate architecture without training, DE-NAS search structures by maximizing the score with different pluggable search strategies. Finally, the best-generated network is trained on domain-specific data or metadata to generate the best model that satisfies the given hardware requirements such as latency or FLOPS. Figure 1 shows the DE-NAS architecture.
Figure 1. DE-NAS Architecture.
Unified Search Space
Figure 2. DE-NAS Search Space and Supernet Configuration.
One key innovative component in DE-NAS is a unified search space that supports multiple models. It leverages transformer-based search space to support different models and eliminates the need to develop new search space for new domains or tasks, thus significantly improving developer productivity. Currently, Computer Vision (CV), Natural Language Processing (NLP), and Automatic Speech Recognition (ASR) search spaces are supported. The details of the search space are shown in Figure 2.
Figure 3. DE-NAS Unified Transformer.
Based on the characteristics of different domains, we construct the corresponding supernet with a unified skeleton as shown in Figure 3. The building blocks of the transformer are shared across the domains, with different configurations for different domains (i.e., the number of transformer layers). This unified architecture leverages the capabilities of transformer structures and eliminates the duplicated efforts to develop different supernets for NAS.
Hardware-Aware Search Strategy
We implement a search strategy that generates candidate architectures that maximize the DE-Score using pluggable search strategies that innovatively integrate latency into a train-free DE-Score as an indicator. Hardware-aware is implemented from two perspectives: (1) A threshold for model parameter size and model latency, and the underlying search engine will filter out the models suitable for the hardware platform in a coarse-grained manner. (2) Integrate latency metrics into the train-free score, which generates latency-guaranteed architectures for a given platform.
Algorithm: Hardware-aware Evolution Algorithm Search |
Require: number of iterations N, number of architectures to sample K, search space S, size of mutation Nm, size of crossover Nc, mutation probability p, parameter threshold Ts, latency threshold Tl |
1: P0 = Initial_population(K, S, Ts,Tl) |
Table 1. Hardware-aware search strategy.
Evolution Algorithm (EA) is shown in Table 1 as an example of the hardware-aware search algorithm. The parameter and latency thresholds are fed into the mutation and crossover functions to select the valid candidates. If the parameter number and inference time of the candidate calculated on the target hardware satisfy the threshold requirement, it will be selected as one candidate architecture that is suitable to the hardware.
Train-Free Score
The train-free score uses an innovative, zero-cost “proxy” to predict model accuracy instead of full training and validation. It used a novel zero-cost metric combined with Gaussian complexity based on network expressivity, NTK[6] score based on network complexity, nuclear norm score based on network diversity, Synflow[7] score based on network saliency, and latency score on network inference latency. The computation of DE-Score only takes a few forward inferences other than iterative training, making it extremely fast, lightweight, and data-free. The overall DE-Score was calculated as the following equation:
Example Usage
This section introduces the usage of DE-NAS APIs, and the following is a step-by-step DE-NAS for NLP BERT example:
Step 1. Define a configuration file for DE-NAS
Figure 4. User configuration of DE-NAS for NLP BERT.
As shown in Figure 4, the above yaml-format file describes the DE-NAS search configuration on BERT. It includes the type of search engine, search hyper-parameter (i.e., batch_size, select_num and population_num), DE-Score parameters (i.e., expressivity score weight and latency weight), and supernet/search space configuration (i.e., model_type).
Step 2. Import related modules from Intel E2E AI Optimization kit
from e2eAIOK.DeNas.search.utils import parse_config, load_best_structure
from e2eAIOK.DeNas.nlp.supernet_bert import SuperBertModel
from e2eAIOK.DeNas.nlp.utils import generate_search_space
from e2eAIOK.DeNas.search.SearchEngineFactory import SearchEngineFactory
from e2eAIOK.DeNas.train import Trainer
Step 3. Load user configuration of DE-NAS
params = parse_config('e2eaiok_denas.conf')
Step 4. Construct the supernet and search space
super_net = SuperBertModel.from_pretrained(params)
search_space = generate_search_space(params["SEARCH_SPACE"])
Step 5. Instantiate a DE-NAS searcher
searcher = SearchEngineFactory.create_search_engine(params = params, super_net = super_net, search_space = search_space)
Step 6. Start the DE-NAS search process
searcher.search()
best_structure = searcher.get_best_structures()
print(f"DE-NAS completed, best structure is {best_structure}")
Step 7. Train and evaluate the DE-NAS searched model
model = load_best_structure(best_structure)
trainer = Trainer(cfg, model) # create DE-NAS trainer
trainer.fit() # trigger the training process
As shown in the above steps, the user can easily utilize DE-NAS to construct a compact network structure “best_structure” that best suits the given platform’s latency & parameter size requirement.
Performance
To evaluate DE-NAS performance, we designed several experiments to demonstrate: (1) whether the DE-NAS can generate multiple-domain models on the target hardware, (2) how it performs compared with multiple stock models on different domains, and (3) how it performs compared with the conventional NAS work.
System Configurations
The tests were conducted on a four-node cluster, each equipped with two Xeon Platinum 8358 CPU and 512GB memory, the nodes were connected through 40Gb Ethernet. One 1TB P4500 NVMe SSD was deployed as a data drive, the detailed configuration was listed in Table 2.
Configuration | Details |
Platform | NFS2580M6 |
CPU |
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz |
Number of Nodes | 4 |
CPU Per Node | 32core/socket, 2 sockets, 2 threads/core |
Memory |
512GB (16x32GB DDR4 3200 MT/s) |
Storage | 1x 240GB INTEL SSDSCKKB24, 1x 1TB INTEL SSDPE2KX010T8 |
Network |
MT27700 Family [ConnectX-4] |
Microcode | 0xd000363 |
BIOS Version | 06.00.05 |
OS/Hypervisor/SW | Red Hat Enterprise Linux 8.6 |
Table 2. System Configuration.
Test Methodology
To evaluate DE-NAS performance, we designed the following test cases:
- DE-NAS comparison with stock models: performance comparison of DE-NAS models and multi-domain stock models.
- DE-NAS comparison with SOTA NAS: performance comparison of DE-NAS and SOTA NAS (Zen-NAS and Autoformer).
- DE-Score effectiveness validation: Spearman correlation of DE-score and F1 score on the NLP BERT models.
The detailed software configuration was shown in Table 3.
CNN | ViT | NLP | ASR | |
Framework | Pytorch 1.12.0 IPEX 1.12.100 | |||
Base model |
Stock model Resnet50 SOTA NAS ZenNas |
Stock model AutoFormer SOTA NAS Autoformer |
Stock model BERT-Base | Stock model RNNT |
Libraries | OneDNN 2022.2.0 | |||
Dataset | CIFAR10 | CIFAR10 | SQuAD v1.1 | LibriSpeech |
Precision | FP32 | |||
Docker Build Flags | e2eaiok/e2eaiok-pytorch112 | |||
KMP AFFINITY | granularity=fine, compact, 1, 0 | |||
OMP_NUM_THREADS | 28 (2 processes per node) | |||
Target Metrics | Acc 0.94 | Acc 0.94 | F1 Score 87.71 | WER 0.058 |
Training Methodology | 200 epochs | 200 epochs | 2 epochs | Early stop at 5.8% WER |
Command Line Used |
python -u search.py --domain [CNN, bert, asr] --conf $CONFIG_FILE python train.py --domain [CNN, bert, asr] --conf $CONFIG_FILE |
Table 3. Software Configuration.
Experimental Results
Overview
Figure 5. DE-NAS overall performance on multi-domain models.
Figure 5 showed DE-NAS overall performance (for CV, NLP and ASR models). It showed that DE-NAS searched CNN, ViT, NLP and ASR models delivered 9.86x, 4.44x, 7.68x and 59.12x training speedup over ResNet50, AutoFormer, Bert-base and RNN-T, with owning smaller footprint and similar accuracy.
Comparison with SOTA NAS
Figure 6. DE-NAS performance comparison with SOTA NAS.
As shown in Figure 6, DE-NAS CNN delivered 40.73x search and 82.57x training speedup over SOTA NAS (Zen-NAS[4]) with 38% model size reduction and 5% accuracy. DE-NAS ViT delivered 35.63x search and 4.44x training speedup over SOTA NAS (AutoFormer[9]) with a 5% accuracy loss.
Spearman Correlation Ecoefficiency
Figure 7. Spearman Correlation of DE-Score and F1 Score with and without Latency Score for NLP Domain.
In Figure 7, the Spearman rank correlation coefficient was used to measure the correlation between DE-Score and model accuracy. For instance, in the NLP domain, the Spearman correlation on 100 network candidates showed a positive correlation 0.40/0.57 of DE-Score with BERT F1 score with and without latency, which demonstrated the effectiveness of DE-Score to evaluate the model performance and model efficiency without any training and validation process.
Call to Action
As one key component in Intel® End-to-End AI Optimization Kit, DE-NAS is a hardware-aware, train-free neural architecture search solution that enables users to construct optimized neural architecture for their specific hardware platform under a given search budget. DE-NAS leverages a zero-cost “proxy” to predict model accuracy instead of full training and validation based on multiple neural networks’ characteristics, and demonstrates very promising results over stock models, SOTA NAS and excellent correlation coefficient with training accuracy. If you want to have a trial for your problem, please visit https://github.com/intel/e2eAIOK[8] repo for more information.
Reference
- Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. “Neural architecture search: A survey.” The Journal of Machine Learning Research 20.1 (2019): 1997-2017.
- Dong, Xuanyi, and Yi Yang. “Nas-bench-201: Extending the scope of reproducible neural architecture search.” arXiv preprint arXiv:2001.00326 (2020).
- Klyuchnikov, Nikita, et al. "NAS-Bench-NLP: neural architecture search benchmark for natural language processing." IEEE Access 10 (2022): 45736-45747.
- Lin, Ming, et al. “Zen-nas: A zero-shot nas for high-performance image recognition.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
- Zhou, Qinqin, et al. “Training-free Transformer Architecture Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
- Lee, Jaehoon, et al. “Wide neural networks of any depth evolve as linear models under gradient descent.” Advances in neural information processing systems 32 (2019).
- Tanaka, Hidenori, et al. “Pruning neural networks without any data by iteratively conserving synaptic flow.” Advances in Neural Information Processing Systems 33 (2020): 6377-6389.
- https://github.com/intel/e2eAIOK
- Chen, Minghao, et al. "Autoformer: Searching transformers for visual recognition." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
Notices & Disclaimers
Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.