Fast Real-time Inference for Convolutional Neural Network Models Built With TensorFlow* 2.16

SusanK_Intel1 · ‎03-19-2024

With the latest release of TensorFlow 2.16.1, AI developers can experience improved performance for real-time inference on convolutional neural network (CNN) models when using the float32 data type. In previous releases of TensorFlow, the memory layout of weights (filter) was organized into a blocked layout memory format when running 2D convolution while using Intel® oneAPI Deep Neural Network Library (oneDNN). Though blocked layouts provide better cache utilization and vectorization, in the case of real-time use cases (with batch size = 1), the overhead of reordering weights from a planar format (specifying height, width of the kernel, and number of input channels, output channels) can create bottlenecks for the execution of 2D convolutions. Hence, after adding support in this release of oneDNN for non-blocked weights for forward convolution, this overhead disappears, and developers may experience improved performance in CNN model execution for real-time use cases.

The table below shows the relative performance improvements for several CNN models on TensorFlow 2.16 when compared to 2.15^. The system configuration details are at the bottom of this blog. All models showed a performance improvement on 2.16, with the MobileNet_v3 models showing the highest relative performance gains. For many model variants, depending on the size of the model, the gains vary and are specified as a range for such model variants.

Model Variants	Performance Improvement
EfficientNet v2	1.04 - up to 1.26x
EfficientNet	1.08 - up to 1.30x
BiT-small- ResNet50x1	1.02x
Inception_v3	1.22x
Inception_resnet_v2	1.21x
ResNet v1	1.16 - up to 1.32x
ResNet v2	1.20x
NASNet	1.10 - up to 1.22x
P-NASNet large	1.03x
MobileNet_v2	1.17 - up to 1.22x
MobileNet_v3	1.28 - up to 1.35x

Next Steps

If you use real-time inference with TensorFlow, consider upgrading to version 16.1 for an automatic performance boost.

TensorFlow

We encourage you to check out Intel’s AI Tools and framework optimizations and learn about the open, standards-based oneAPI multiarchitecture, multivendor programming model that forms the foundation of Intel’s AI software portfolio.

Resources

Product and Performance Information

^ Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz with 176GB (11x16GB), hyperthreading on, Intel Turbo Boost Disabled, Debian GNU/Linux 11 (bullseye), kernel 6.0.0-0.deb11.6-amd64;

The system used is a C3 GCP (Google Cloud Platform) instance, with a 4th Gen Intel Xeon Scalable processor: 1-socket 44 V-CPUs (Virtual CPU) (22 physical cores on one socket with hyperthreading);

Software: Python 3.9.2, TensorFlow (v2.15.0 and v2.16.1), tested by Intel on March 18, 2024.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.