With the latest release of TensorFlow 2.16.1, AI developers can experience improved performance for real-time inference on convolutional neural network (CNN) models when using the float32 data type. In previous releases of TensorFlow, the memory layout of weights (filter) was organized into a blocked layout memory format when running 2D convolution while using Intel® oneAPI Deep Neural Network Library (oneDNN). Though blocked layouts provide better cache utilization and vectorization, in the case of real-time use cases (with batch size = 1), the overhead of reordering weights from a planar format (specifying height, width of the kernel, and number of input channels, output channels) can create bottlenecks for the execution of 2D convolutions. Hence, after adding support in this release of oneDNN for non-blocked weights for forward convolution, this overhead disappears, and developers may experience improved performance in CNN model execution for real-time use cases.
The table below shows the relative performance improvements for several CNN models on TensorFlow 2.16 when compared to 2.15^. The system configuration details are at the bottom of this blog. All models showed a performance improvement on 2.16, with the MobileNet_v3 models showing the highest relative performance gains. For many model variants, depending on the size of the model, the gains vary and are specified as a range for such model variants.
Model Variants | Performance Improvement |
EfficientNet v2 | 1.04 - up to 1.26x |
EfficientNet | 1.08 - up to 1.30x |
BiT-small- ResNet50x1 | 1.02x |
Inception_v3 | 1.22x |
Inception_resnet_v2 | 1.21x |
ResNet v1 | 1.16 - up to 1.32x |
ResNet v2 | 1.20x |
NASNet | 1.10 - up to 1.22x |
P-NASNet large | 1.03x |
MobileNet_v2 | 1.17 - up to 1.22x |
MobileNet_v3 | 1.28 - up to 1.35x |
Next Steps
If you use real-time inference with TensorFlow, consider upgrading to version 16.1 for an automatic performance boost.
TensorFlow
We encourage you to check out Intel’s AI Tools and framework optimizations and learn about the open, standards-based oneAPI multiarchitecture, multivendor programming model that forms the foundation of Intel’s AI software portfolio.
Resources
- Image Classification with TensorFlow Hub
- Understanding Memory Formats: Blocked Layout
- Optimizing TensorFlow for 4th Gen Intel Xeon Processors
Product and Performance Information
^ Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz with 176GB (11x16GB), hyperthreading on, Intel Turbo Boost Disabled, Debian GNU/Linux 11 (bullseye), kernel 6.0.0-0.deb11.6-amd64;
The system used is a C3 GCP (Google Cloud Platform) instance, with a 4th Gen Intel Xeon Scalable processor: 1-socket 44 V-CPUs (Virtual CPU) (22 physical cores on one socket with hyperthreading);
Software: Python 3.9.2, TensorFlow (v2.15.0 and v2.16.1), tested by Intel on March 18, 2024.
Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.