Enable Mixed Precision in TensorFlow* 2.13 via Environment Variable

SusanK_Intel1 · ‎08-17-2023

One way to make deep learning models run faster during training and inference while also using less memory is to take advantage of mixed precision. Mixed precision can enable a model using the 32-bit floating point (FP32) data type to use the BFloat16 (BF16) data type in conjunction with the FP32 data type. TensorFlow* developers can take advantage of the Intel® Advanced Matrix Extension (AMX) on the 4th Gen Intel® Xeon® Scalable processor with the existing mixed-precision support starting with TensorFlow 2.12.

Intel AMX has two primary components: tiles and tiled matrix multiplication (TMUL). The tiles store large amounts of data in eight two-dimensional registers, each one kilobyte in size. TMUL is an accelerator engine attached to the tiles that contain instructions to compute larger matrices in a single operation.

In TensorFlow 2.13, a new environment variable, TF_SET_ONEDNN_FPMATH_MODE, is introduced to enable mixed precision computation in hardware powered by 4th Gen Intel® Xeon® Scalable processor. When setting this variable TF_SET_ONEDNN_FPMATH_MODE=BF16, the Intel® oneDNN library inside TensorFlow will perform reduced precision math computation on high-cost operations such as convolution and matrix multiplication. The conversion between FP32 and BF16 is handled inside the oneDNN library. This environment variable can be added as a prefix to the model run command, as shown below.

TF_SET_ONEDNN_FPMATH_MODE=BF16 python mnist_convnet.py

We benchmarked different inference models from Intel Model Zoo and compared the performance with the environment variable TF_SET_ONEDNN_FPMATH_MODE=BF16 versus FP32. The benchmark was completed for both throughput and latency measurement. In the throughput case, samples are processed in batches on one numa node or one socket, and higher throughput is desired; in the latency case, a single sample is processed each time, usually on 4 cores, and less time to complete a task is better. The benchmark results for relative throughput and latency gains¹ are presented below in Figures 1 and 2, respectively.

Throughput performance improvement via TF_SET_ONEDNN_FPMATH_MODE=BF16

Figure 1. Relative Throughput Gain with Environment Setting vs. FP32

Latency performance improvement via TF_SET_ONEDNN_FPMATH_MODE=BF16

Figure 2. Relative Latency Gain with Environment Setting vs. FP32

We at Intel are delighted to be part of the TensorFlow community and appreciate the collaborative relationship with our colleagues on Google’s TensorFlow team as we implemented this new feature.

The environment variable introduced here adds a new option for users to take advantage of BF16 capability on the Intel CPU platform. The guide, Getting Started with Mixed Precision Support in oneDNN Bfloat16, details the different ways to enable BF16 mixed precision in TensorFlow. Compared to other enablement options, using the environment setting is easy to deploy as it does not require any change to the model script. However, the performance gain may not be as large as the gain from updating the model script to enable auto-mixed precision (AMP) grappler pass. For full BF16 performance benefit, the AMP is still the recommended option, while the environment setting can be used when the model script update is not possible.

Intel releases its newest optimizations and features in Intel® Extension for TensorFlow* before upstreaming them into the official TensorFlow release. Intel® Extension for TensorFlow* is targeted for Intel® Data Center Max GPU Series and Intel® Data Center Flex GPU Series. Experimental support is available for 4th Gen Intel® Xeon, HBM, and Intel® Arc™ A-Series GPUs. Download the Quick Get Started guide.

Next Steps

Try out TensorFlow 2.13 and realize the performance benefits for yourself from AMX support for mixed-precision training and inference.

For more details about 4th Gen Intel Xeon Scalable processor, visit AI Platform, where you can learn about how Intel is empowering developers to run high-performance, efficient end-to-end AI pipelines.

Resources

About our Expert

Susan is a Product Marketing Manager for AIML at Intel. She has her Ph.D. in Human Factors and Ergonomics, having used analytics to quantify and compare mental models of how humans learn complex operations. Throughout her well-rounded career, she has held roles in user-centered design, product management, customer insights, consulting, and operational risk.

Product and Performance Information

¹Hardware system - Google Compute Engine
Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz processor, 88 cores (44 cores/socket, 2 sockets), HT On, Turbo Off, Total Memory 352 GB (16 slots/ 16 GB/2000MHz ), Google BIOS, microcode 0xffffffff, Debian GNU/Linux 11 (bullseye), 6.0.0-0.deb11.6-amd64, gcc (Debian 10.2.1-6) 10.2.1 20210110, nvme_card-pd 1000G; Tested by Intel on Jul 20, 2023.
Software – TensorFlow r2.13
Workload - inference benchmark from Intel Model Zoo on ResNet50 v1.5, 3D U-Net MLPerf, MobileNet v1, SSD-ResNet34, BERT Large SQuAD, Transformer MLPerf
2Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.