Intel® Low Precision Optimization Tool

MaryT_Intel · ‎09-13-2020

Key Takeaways

Learn how Intel technology helps to boost up the low precision inference of deep learning workload with 2nd and 3rd Gen Intel Xeon Scalable Processors.
Step-by-step tutorial on how to use Intel Low Precision Optimization Tool to develop the low-precision inference solution quickly on Intel platforms.

Deep neural networks (DNNs) show state-of-the-art (SOTA) accuracy in a wide range of computation tasks. However, they still face challenges during industrial deployment due to its high computational complexity of inference. Low precision is one of the key techniques being actively studied recently to conquer the problem. With hardware acceleration support, low precision inference can compute more operations per second, reduce the memory access pressure and better utilize the cache, and deliver higher throughput and lower latency.

In this document, we would like to introduce Intel® Low Precision Optimization Tool. This tool aims to help Intel customers deploy low-precision inference solution easily and rapidly on multiple deep learning frameworks (TensorFlow, PyTorch, and MXNet)

Introduction

Intel® Low Precision Optimization Tool is an open-source python library to help users to fast deploy low-precision inference solution on popular deep learning frameworks including TensorFlow, PyTorch, MXNet etc. Intel® Low Precision Optimization Tool v1.0 alpha is released recently, featuring:

· Built-in tuning strategies, including Basic, Bayesian, and MSE

· Built-in evaluation metrics, including TopK (image classification), F1 (NLP), and CocoMAP (object detection)

· Built-in tuning objectives, including Performance, Model Size, and Footprint

· Extensible API design to add new strategy, framework backend, metric, and objective

· KL-divergence calibration for TensorFlow and MXNet

· Tuning process resuming from certain checkpoint

Intel® DL Boost

Intel® DL Boost is built into the second-generation Intel® Xeon® scalable processor. Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel DL Boost Vector Neural Network Instructions (VNNI) delivers a significant performance improvement by combining three instructions into one—thereby maximizing the use of compute resources, utilizing the cache better, and avoiding potential bandwidth bottlenecks.

Since second-generation Intel® Xeon® Scalable processors, Intel® DL Boost provides a theoretical peak speedup of 4x for INT8 inference in comparison to FP32 inference. Developers can use this tool to convert a FP32 trained model to an INT8 quantized model. This new INT8 model will benefit from Intel® DL Boost acceleration when used for inference in place of the earlier FP32 model and run on second-generation Intel® Xeon® Scalable processors.

Easy quantization

Intel® Low Precision Optimization Tool provides an easy way to enable quantization from the scratch. Assuming there is a FP32 model for deployment, user can produce the quantized model within two steps:

· Config yaml file. It is to define tuning config and model-specific information. Here is a sample yaml for TensorFlow MobileNetv1.0:

framework:
- name: tensorflow
inputs: input # tensorflow only
outputs: MobilenetV1/Predictions/Reshape_1 # tensorflow only

tuning:
metric:
- topk: 1
accuracy_criterion:
- relative: 0.01
timeout: 3600 # tuning time(seconds)
random_seed: 9527

The above setting means to tune out the best quantized model which has best inference performance and meets relative 1% accuracy loss from FP32 accuracy during the time range of 3600 seconds.

· Use Tuner.tune(). API is the main entry of automatic tuning and the definition is as following:

class Tuner(object):
def tune(self, model, q_dataloader, q_func=None, eval_dataloader=
None, eval_func=None, resume_file=None):

The Intel® Low Precision Optimization Tool v1.0a release supports two usages:

a) User specifies fp32 "model", calibration dataset "q_dataloader", evaluation dataset "eval_dataloader" and accuracy metrics in tuning.metric field of the yaml config file.

This is designed for seamless enablement of DL model tuning with the tool, leveraging the pre-defined accuracy metrics supported by the tool. We expect this is the most common usage of the tool. Now it works well for most image classification models, and we are improving the tool to cover more workload categories.

b) User specifies fp32 "model", calibration dataset "q_dataloader" and a custom "eval_func" which encapsulates the evaluation dataset and accuracy metrics by itself.

This is designed for ease of tuning enablement for models with custom metric evaluation or metrics not supported by the tool yet. Currently this usage model works for object detection and NLP networks.

Example 1

Below is step-by-step of how to enable easy quantization for TensorFlow ResNet50 V1.5 using the first usage.

Prepare Yaml File

Copy examples/template.yaml to work directory and keep mandatory items correspondingly. Here is the yaml file of MXNet MobileNetv1.0:

framework:
- name: tensorflow
inputs: input_tensor
outputs: softmax_tensor

tuning:
metric:
- topk: 1
accuracy_criterion:
- relative: 0.01
timeout: 0
random_seed: 9527

Here we choose topk built-in metric and set accuracy target as tolerating relative 1% accuracy loss of baseline. The default tuning strategy is basic. The timeout 0 means early stop if a tuning config meet accuracy target.

Code Changes:

1. Import ilit python package

2. Create tuner objective using yaml file

3. Invoke ilit.tune() interface with calibration dataloader

import ilit

tuner = ilit.Tuner(self.args.config)
dataloader = Dataloader(self.args.data_location, 'validation',
RESNET_IMAGE_SIZE, RESNET_IMAGE_SIZE, self.args.batch_size,
num_cores=self.args.num_cores, resize_method='crop')
q_model = tuner.tune(self.args.input_graph, q_dataloader=dataloader,
eval_func=None, eval_dataloader=dataloader)

Example 2

Below is step-by-step of how to enable easy quantization for SSD-ResNet50v1.0 using the second usage. This usage will use eval_func() user provided to do evaluation.

Prepare Yaml File

Here is the yaml file of MXNet SSD-ResNet50v1.0:

framework:
- name: mxnet

tuning:
accuracy_criterion:
- relative: 0.01
timeout: 0 # 0 means early stop
random_seed: 9527

Here we set accuracy target as tolerating relative 1% accuracy loss of baseline. The default tuning strategy is basic. The timeout 0 means early stop if a tuning config meet accuracy target.

Code Changes:

1. Import ilit python package

2. Create tuner objective using yaml file

3. Implement eval_func() like below

Invoke ilit.tune() interface with calibration dataloader, here we reuse existing validation dataloader as calibration dataloader

import ilit

def eval_func(graph):
val_dataset, val_metric = get_dataset(args.dataset, args.data_shape)
val_data = get_dataloader(
val_dataset, args.data_shape, args.batch_size, args.num_workers)
classes = val_dataset.classes # class names
size = len(val_dataset)
ctx = [mx.cpu()]
results = validate(graph, val_data, ctx, classes, size, val_metric)
mAP = float(results[-1][-1])
return mAP

tuner = ilit.Tuner("./ssd.yaml")
quantized_model = tuner.tune(net, q_dataloader=val_data,
val_dataloader= val_dataset, eval_func=eval_func)

Tuning Results

Intel® Low Precision Optimization Tool v1.0 alpha release already supported 30 deep learning workloads, covering all popular use cases including image classification, object detection, NLP, and recommendation systems. Below table shows the results on three Intel optimized frameworks on CLX8280 with TSX disabled. For detail reproduce steps, please refer to this link.

Future Works

We plan to add more sophisticated tuning strategies and metrics to facilitate accuracy-driven tuning more effectively. We also explore the quantitation support for more backends.

Please use this tool if you want to deploy a low precision solution quickly. You are also very welcome to submit a feature request or an issue via ilit.maintainers@intel.com during your usage.

MXNet V1.6.x	Model	Tuning Strategy	INT8 Tuning Accuracy	FP32 Accuracy Baseline	Relative Accuracy Drop[(INT8-FP32)/FP32]	INT8/FP32 Speedup
	ResNet50 V1	mse	76.40%	76.80%	-0.52%	3.73x
	MobileNet V1	mse	71.60%	72.10%	-0.69%	3.02x
	MobileNet V2	mse	71.00%	71.10%	-0.14%	3.88x
	SSD-ResNet50	basic	29.50%	29.70%	-0.67%	1.86x
	SqueezeNet V1	mse	57.30%	57.20%	0.18%	2.88x
	ResNet18	mse	70.50%	70.40%	0.14%	2.98x
	Inception V3	mse	78.20%	78.00%	0.26%	3.35x

TensorFlow v1.15.2	Model	Tuning Strategy	INT8 Tuning Accuracy	FP32 Accuracy Baseline	Relative Accuracy Drop[(INT8-FP32)/FP32]	INT8/FP32 Speedup
	ResNet50 V1	mse	73.28%	73.54%	-0.35%	2.99x
	ResNet50 V1.5	bayesian	75.70%	76.26%	-0.73%	1.95x
	ResNet101	basic	76.68%	75.58%	1.46%	3.03x
	Inception V1	basic	69.54%	69.48%	0.09%	2.18x
	Inception V2	basic	74.32%	74.38%	-0.08%	1.69x
	Inception V3	basic	76.54%	76.90%	-0.47%	2.02x
	Inception V4	basic	79.74%	80.12%	-0.47%	3.40x
	ssd_resnet50_v1	basic	37.80%	38.01%	-0.55%	1.82x

PyTorch v1.5.0	Model	Tuning Strategy	INT8 Tuning Accuracy	FP32 Accuracy Baseline	Relative Accuracy Drop[(INT8-FP32)/FP32]	INT8/FP32 Speedup
	DLRM	basic	80.21%	80.27%	-0.08%	1.87x
	BERT-Large MRPC	basic	87.90%	88.30%	-0.45%	2.38x
	BERT-Large SQUAD	basic	92.15%	93.05%	-0.96%	1.42x
	BERT-Large CoLA	basic	62.10%	62.60%	-0.80%	1.76x
	BERT-Base STS-B	basic	88.50%	89.30%	-0.90%	3.05x
	BERT-Base CoLA	basic	58.30%	58.80%	-0.85%	3.01x
	BERT-Base MRPC	basic	88.30%	88.70%	-0.45%	2.34x
	BERT-Base SST-2	basic	90.90%	91.90%	-1.09%	1.64x
	BERT-Base RTE	basic	69.30%	69.70%	-0.57%	2.95x
	BERT-Large RTE	basic	72.90%	72.60%	0.41%	2.38x
	BERT-Large QNLI	basic	91.00%	91.80%	-0.87%	2.25x
	ResNet50 V1.5	bayesian	75.60%	76.10%	-0.66%	2.76x
	ResNet18	bayesian	69.50%	69.80%	-0.43%	2.61x
	ResNet101	bayesian	77.00%	77.40%	-0.52%	2.64x

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.