Key Takeaways
- Learn how Intel technology helps to boost up the low precision inference of deep learning workload with 2nd and 3rd Gen Intel Xeon Scalable Processors.
- Step-by-step tutorial on how to use Intel Low Precision Optimization Tool to develop the low-precision inference solution quickly on Intel platforms.
Deep neural networks (DNNs) show state-of-the-art (SOTA) accuracy in a wide range of computation tasks. However, they still face challenges during industrial deployment due to its high computational complexity of inference. Low precision is one of the key techniques being actively studied recently to conquer the problem. With hardware acceleration support, low precision inference can compute more operations per second, reduce the memory access pressure and better utilize the cache, and deliver higher throughput and lower latency.
In this document, we would like to introduce Intel® Low Precision Optimization Tool. This tool aims to help Intel customers deploy low-precision inference solution easily and rapidly on multiple deep learning frameworks (TensorFlow, PyTorch, and MXNet)
Introduction
Intel® Low Precision Optimization Tool is an open-source python library to help users to fast deploy low-precision inference solution on popular deep learning frameworks including TensorFlow, PyTorch, MXNet etc. Intel® Low Precision Optimization Tool v1.0 alpha is released recently, featuring:
· Built-in tuning strategies, including Basic, Bayesian, and MSE
· Built-in evaluation metrics, including TopK (image classification), F1 (NLP), and CocoMAP (object detection)
· Built-in tuning objectives, including Performance, Model Size, and Footprint
· Extensible API design to add new strategy, framework backend, metric, and objective
· KL-divergence calibration for TensorFlow and MXNet
· Tuning process resuming from certain checkpoint
Intel® DL Boost
Intel® DL Boost is built into the second-generation Intel® Xeon® scalable processor. Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel DL Boost Vector Neural Network Instructions (VNNI) delivers a significant performance improvement by combining three instructions into one—thereby maximizing the use of compute resources, utilizing the cache better, and avoiding potential bandwidth bottlenecks.
Since second-generation Intel® Xeon® Scalable processors, Intel® DL Boost provides a theoretical peak speedup of 4x for INT8 inference in comparison to FP32 inference. Developers can use this tool to convert a FP32 trained model to an INT8 quantized model. This new INT8 model will benefit from Intel® DL Boost acceleration when used for inference in place of the earlier FP32 model and run on second-generation Intel® Xeon® Scalable processors.
Easy quantization
Intel® Low Precision Optimization Tool provides an easy way to enable quantization from the scratch. Assuming there is a FP32 model for deployment, user can produce the quantized model within two steps:
· Config yaml file. It is to define tuning config and model-specific information. Here is a sample yaml for TensorFlow MobileNetv1.0:
framework:
- name: tensorflow
inputs: input # tensorflow only
outputs: MobilenetV1/Predictions/Reshape_1 # tensorflow only
tuning:
metric:
- topk: 1
accuracy_criterion:
- relative: 0.01
timeout: 3600 # tuning time(seconds)
random_seed: 9527
The above setting means to tune out the best quantized model which has best inference performance and meets relative 1% accuracy loss from FP32 accuracy during the time range of 3600 seconds.
· Use Tuner.tune(). API is the main entry of automatic tuning and the definition is as following:
class Tuner(object):
def tune(self, model, q_dataloader, q_func=None, eval_dataloader=
None, eval_func=None, resume_file=None):
The Intel® Low Precision Optimization Tool v1.0a release supports two usages:
a) User specifies fp32 "model", calibration dataset "q_dataloader", evaluation dataset "eval_dataloader" and accuracy metrics in tuning.metric field of the yaml config file.
This is designed for seamless enablement of DL model tuning with the tool, leveraging the pre-defined accuracy metrics supported by the tool. We expect this is the most common usage of the tool. Now it works well for most image classification models, and we are improving the tool to cover more workload categories.
b) User specifies fp32 "model", calibration dataset "q_dataloader" and a custom "eval_func" which encapsulates the evaluation dataset and accuracy metrics by itself.
This is designed for ease of tuning enablement for models with custom metric evaluation or metrics not supported by the tool yet. Currently this usage model works for object detection and NLP networks.
Example 1
Below is step-by-step of how to enable easy quantization for TensorFlow ResNet50 V1.5 using the first usage.
Prepare Yaml File
Copy examples/template.yaml to work directory and keep mandatory items correspondingly. Here is the yaml file of MXNet MobileNetv1.0:
framework:
- name: tensorflow
inputs: input_tensor
outputs: softmax_tensor
tuning:
metric:
- topk: 1
accuracy_criterion:
- relative: 0.01
timeout: 0
random_seed: 9527
Here we choose topk built-in metric and set accuracy target as tolerating relative 1% accuracy loss of baseline. The default tuning strategy is basic. The timeout 0 means early stop if a tuning config meet accuracy target.
Code Changes:
1. Import ilit python package
2. Create tuner objective using yaml file
3. Invoke ilit.tune() interface with calibration dataloader
import ilit
tuner = ilit.Tuner(self.args.config)
dataloader = Dataloader(self.args.data_location, 'validation',
RESNET_IMAGE_SIZE, RESNET_IMAGE_SIZE, self.args.batch_size,
num_cores=self.args.num_cores, resize_method='crop')
q_model = tuner.tune(self.args.input_graph, q_dataloader=dataloader,
eval_func=None, eval_dataloader=dataloader)
Example 2
Below is step-by-step of how to enable easy quantization for SSD-ResNet50v1.0 using the second usage. This usage will use eval_func() user provided to do evaluation.
Prepare Yaml File
Here is the yaml file of MXNet SSD-ResNet50v1.0:
framework:
- name: mxnet
tuning:
accuracy_criterion:
- relative: 0.01
timeout: 0 # 0 means early stop
random_seed: 9527
Here we set accuracy target as tolerating relative 1% accuracy loss of baseline. The default tuning strategy is basic. The timeout 0 means early stop if a tuning config meet accuracy target.
Code Changes:
1. Import ilit python package
2. Create tuner objective using yaml file
3. Implement eval_func() like below
Invoke ilit.tune() interface with calibration dataloader, here we reuse existing validation dataloader as calibration dataloader
import ilit
def eval_func(graph):
val_dataset, val_metric = get_dataset(args.dataset, args.data_shape)
val_data = get_dataloader(
val_dataset, args.data_shape, args.batch_size, args.num_workers)
classes = val_dataset.classes # class names
size = len(val_dataset)
ctx = [mx.cpu()]
results = validate(graph, val_data, ctx, classes, size, val_metric)
mAP = float(results[-1][-1])
return mAP
tuner = ilit.Tuner("./ssd.yaml")
quantized_model = tuner.tune(net, q_dataloader=val_data,
val_dataloader= val_dataset, eval_func=eval_func)
Tuning Results
Intel® Low Precision Optimization Tool v1.0 alpha release already supported 30 deep learning workloads, covering all popular use cases including image classification, object detection, NLP, and recommendation systems. Below table shows the results on three Intel optimized frameworks on CLX8280 with TSX disabled. For detail reproduce steps, please refer to this link.
Future Works
We plan to add more sophisticated tuning strategies and metrics to facilitate accuracy-driven tuning more effectively. We also explore the quantitation support for more backends.
Please use this tool if you want to deploy a low precision solution quickly. You are also very welcome to submit a feature request or an issue via ilit.maintainers@intel.com during your usage.
MXNet V1.6.x |
Model |
Tuning Strategy |
INT8 Tuning Accuracy |
FP32 Accuracy Baseline |
Relative Accuracy Drop[(INT8-FP32)/FP32] |
INT8/FP32 Speedup |
ResNet50 V1 |
mse |
76.40% |
76.80% |
-0.52% |
3.73x |
|
MobileNet V1 |
mse |
71.60% |
72.10% |
-0.69% |
3.02x |
|
MobileNet V2 |
mse |
71.00% |
71.10% |
-0.14% |
3.88x |
|
SSD-ResNet50 |
basic |
29.50% |
29.70% |
-0.67% |
1.86x |
|
SqueezeNet V1 |
mse |
57.30% |
57.20% |
0.18% |
2.88x |
|
ResNet18 |
mse |
70.50% |
70.40% |
0.14% |
2.98x |
|
Inception V3 |
mse |
78.20% |
78.00% |
0.26% |
3.35x |
TensorFlow v1.15.2 |
Model |
Tuning Strategy |
INT8 Tuning Accuracy |
FP32 Accuracy Baseline |
Relative Accuracy Drop[(INT8-FP32)/FP32] |
INT8/FP32 Speedup |
ResNet50 V1 |
mse |
73.28% |
73.54% |
-0.35% |
2.99x |
|
ResNet50 V1.5 |
bayesian |
75.70% |
76.26% |
-0.73% |
1.95x |
|
ResNet101 |
basic |
76.68% |
75.58% |
1.46% |
3.03x |
|
Inception V1 |
basic |
69.54% |
69.48% |
0.09% |
2.18x |
|
Inception V2 |
basic |
74.32% |
74.38% |
-0.08% |
1.69x |
|
Inception V3 |
basic |
76.54% |
76.90% |
-0.47% |
2.02x |
|
Inception V4 |
basic |
79.74% |
80.12% |
-0.47% |
3.40x |
|
ssd_resnet50_v1 |
basic |
37.80% |
38.01% |
-0.55% |
1.82x |
PyTorch v1.5.0 |
Model |
Tuning Strategy |
INT8 Tuning Accuracy |
FP32 Accuracy Baseline |
Relative Accuracy Drop[(INT8-FP32)/FP32] |
INT8/FP32 Speedup |
DLRM |
basic |
80.21% |
80.27% |
-0.08% |
1.87x |
|
BERT-Large MRPC |
basic |
87.90% |
88.30% |
-0.45% |
2.38x |
|
BERT-Large SQUAD |
basic |
92.15% |
93.05% |
-0.96% |
1.42x |
|
BERT-Large CoLA |
basic |
62.10% |
62.60% |
-0.80% |
1.76x |
|
BERT-Base STS-B |
basic |
88.50% |
89.30% |
-0.90% |
3.05x |
|
BERT-Base CoLA |
basic |
58.30% |
58.80% |
-0.85% |
3.01x |
|
BERT-Base MRPC |
basic |
88.30% |
88.70% |
-0.45% |
2.34x |
|
BERT-Base SST-2 |
basic |
90.90% |
91.90% |
-1.09% |
1.64x |
|
BERT-Base RTE |
basic |
69.30% |
69.70% |
-0.57% |
2.95x |
|
BERT-Large RTE |
basic |
72.90% |
72.60% |
0.41% |
2.38x |
|
BERT-Large QNLI |
basic |
91.00% |
91.80% |
-0.87% |
2.25x |
|
ResNet50 V1.5 |
bayesian |
75.60% |
76.10% |
-0.66% |
2.76x |
|
ResNet18 |
bayesian |
69.50% |
69.80% |
-0.43% |
2.61x |
|
ResNet101 |
bayesian |
77.00% |
77.40% |
-0.52% |
2.64x |
Notices and Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.