Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
597 Discussions

Accelerating TensorFlow* Inference with Intel® Deep Learning Boost on 2nd Gen Intel® Xeon® Scalable Processors

MaryT_Intel
Employee
1 0 3,876
To enable the Intel DL boost capabilities on 2nd generation Intel® Xeon® Scalable processors, we have enhanced the Intel® Optimization for TensorFlow to support the seamless use of 8-bit inference on models already using 32-bit floating point.

Intel has recently introduced Intel® Deep Learning Boost (Intel® DL Boost), a new set of embedded processor technologies designed to accelerate deep learning applications. Intel DL Boost includes new Vector Neural Network Instructions (VNNI) that can be used to perform computation in 8-bit precision that essentially reduces memory usage by 4x and increases the rate of arithmetic operations executed per second compared to floating point precision. Given a pre-trained model with floating point precision, we obtained a quantized version of the model to exploit Intel DL Boost instructions and accelerate inference performance. Here, we summarize our inference work with 8-bit precision in TensorFlow* using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).

Quantization in TensorFlow

To enable the Intel DL boost capabilities on 2nd generation Intel® Xeon® Scalable processors, we have enhanced the Intel® Optimization for TensorFlow to support the seamless use of 8-bit inference on models already using 32-bit floating point, with no additional libraries required.

We have also developed the Intel Optimization for TensorFlow Quantization tool, an offline tool that converts a pre-trained 32-bit float model to a quantized model using 8-bit inference. A detailed description and guidelines for model quantization can be found at Intel AI Quantization Tools for TensorFlow. Using these tools, we were able to quantize a number of popular deep learning models, including convolutional and feedforward neural networks while preserving a high level of accuracy, as shown in Table 1. For ready-to-use purposes, we hosted several quantized models in Intel-model-zoo.

Quantizing Models

We have enabled post-training model quantization, which means that users can take a pre-trained floating point model and quantize it. It involves converting floating point activations and weights into 8-bit integers and replacing floating point operators in the computation graph by their quantized versions. The key steps to obtain an optimized quantized model using our tools are as follows:

  • Export fp32 inference model as serialized TensorFlow GraphDef: This includes saving an inference graph in protobuf format and applying graph transformations for removing redundant nodes (e.g, Identity, CheckNumerics etc), folding constants, and folding batch-normalization.
  • Convert fp32-graph into a quantized-graph: This step replaces fp32-ops with possible fused quantized ops and adds necessary conversion ops (e.g., ‘QuantizeV2’, ‘Requanitze’ etc) for activations. Weight quantization also takes place during this step.
  • Calibrate and optimize quantized graph: This step runs quantized graph obtained from the previous step on a small subset of training data (or calibration data) and freezes the ranges of activations. The resulting graph is further optimized by fusing ‘Requanitze’ ops.

To illustrate the aforementioned steps, we show resulting subgraphs from each of the steps in Figure 1. Note that widely used CNNs (e.g., ResNet, Inception etc) exhibit a repeating pattern of conv2d → batch-norm → relu op sequence. After batch-norm folding this pattern turns into a similar subgraph of Figure 1(a) which is replaced by a fused quantized operator as shown in Figure 1(b). Further optimization, as shown in Figure 1(c), is done after calibration. Since most convolution receives non-negative rectified input due to relu as its preceding op, our quantized convolutions takes unsigned 8-bit integer as input and signed 8-bit integer as filter. This unsigned and signed combination is also important for performance since required arithmetic operations can be done efficiently with the currently available Intel DL Boost instruction VPDPBUSD (for details, see Intel Software Manual).

cq5dam.web.1280.1280.jpeg

Figure 1 (a)

cq5dam.web.1280.1280.jpeg

Figure 1 (b)

cq5dam.web.1280.1280.jpeg

Figure 1 (c): Resulting subgraphs after each of the three steps, (a) fp32, (b) 8-bit quantized, and (c) calibrated 8-bit quantized.

Besides Conv2D and Matmul ops that can exploit Intel DL Boost instructions, we also have quantized pooling and concat ops that reduces memory bandwidth bottleneck significantly and avoid unnecessary quantize and dequantize ops. In fact, to get the best performance it is recommended to have an uninterrupted flow of 8-bit precision ops as much as possible.

Some of the pre-trained models (e.g., MobileNet) show different data distribution in their weight tensors across different channels. Having a single scale parameter for quantizing a weight tensor may exhibit large accuracy loss in such case. We mitigated this shortcoming by introducing new operators, such as `RequantizePerChannel` and `RequantizationRangePerChannel`. With this per-channel extension of our model quantization tool, we were able to recover accuracy loss of Mobilenet related models.

Accuracy and Performance

We have enabled 8-bit inference for several popular deep learning models for image classification, object detection, and recommender systems. Table 1 reports some of CNN models’ accuracy and performance speed up on 2nd gen Intel Xeon Scalable processors. As can be seen, Intel DL Boost speeds up the inference significantly while keeping the accuracy very close to that of fp32 models. [1]

Model Top 1 Accuracy (%) Throughput Speedup
FP32 (Intel Xeon Scalable) INT8 (2nd Gen Intel Xeon Scalable) 2nd Gen Intel Xeon Scalable
ResNet-50 74.30 73.75 3.9x
ResNet-101 76.40 75.66 4.0x
InceptionV3 76.75 76.51 3.1x

Table 1: Accuracy and performance of floating point and quantized models.

To conclude, Intel DL Boost on 2nd gen Intel Xeon Scalable processors delivers promising results for accelerating deep models used for computer vision, natural language and speech processing. With our developed toolset, you can quantize fp32 models for improved inference performance in TensorFlow without any other library dependency. Check out our quantization tools and examples at intel-quantization-tool.

Further Reading

Additional Contributors

Acknowledgements

Notices and Disclaimers

System configuration

2 socket Intel® Xeon® Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode:0x4000013),CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: https://hub.docker.com/r/intelaipg/intel-optimized-tensorflow:PR25765-devel-mkl (https://github.com/tensorflow/tensorflow.git commit: 6f2eaa3b99c241a9c09c345e1029513bc4cd470a + Pull Request PR 25765, PR submitted for upstreaming), Compiler: gcc 6.3.0,MKL DNN version: v0.17, Datatype: INT8

About the Author
Mary is the Community Manager for this site. She likes to bike, and do college and career coaching for high school students in her spare time.