Artificial Intelligence (AI)
Engage with our experts on topics in AI
292 Discussions

Shifted and Squeezed: How Much Precision Do You Need?

Community Manager
0 0 75

Shifted and Squeezed: How Much Precision Do You Need?

That’s a critical question when developing efficient Deep Neural Network (DNN) models, because there is a direct tradeoff between DDN floating point precision (i.e., 32-bit, 16-bit, 8-bit) and performance, required memory footprint, power consumption and cost. DDNs enable state-of-the-art performance on a wide variety of AI tasks, including computer vision, audio, and natural language processing.

In fact, the success of DDNs in handling complex problems has resulted in an explosion of interest across many industries and use cases in pushing the technology even further. One promising way to do that is to reduce the amount of memory required to perform deep neural networking tasks (and hence also cut their energy consumption). Researchers have looked at a number of possible avenues for achieving this, but one method in particular stands out for potential effectiveness: reducing the numerical precision required during processing.

In machine learning, the process of training, where data sets are repeatedly processed and results generated to drive AI algorithms, is both time- and processing- intensive. Running training with a larger number of parameters, while maintaining high-speed iterations, is an increasingly important trend that’s helping researchers develop better performing DDNs.

But how can we streamline the processing required for all these calculations? One way is to reduce the precision of each bit computed: from 32-bit or 16-bit, to 8-bit—but to do so without negatively affecting calculation results.

Why do we want lower precision? Because it saves computing cycles. In some ways, this process is similar to rounding parameters in an equation: it takes longer to calculate a value to the 10th place than it does to calculate it to 5 places. If the additional information provided by 10-place rounding is critical, then cutting to 5 places won’t produce the results you need. But if you can figure out a way to get more information out of a truncated rounding—or to calculate a deeper rounding more efficiently—you can improve performance and get more training done more quickly. You’ll have hit the sweet spot balancing performance and results.


In pursuit of streamlining AI, we studied ways to create a 8-bit floating point (FP) format (FP8) using “squeezed” and “shifted data.” The study, entitled Shifted and Squeezed 8-bit Floating Point Format for Low-precision Training of Deep Neural Networks, seeks to validate performance advantages when compared to previously proposed 8-bit training methods. This format can deliver several significant advantages:

·       Eliminate need for loss scaling, which requires significant tuning.
·       Enable more effective adjustments of gradients, activations, and weights.
·       Avoid the requirement to rely on - 32-bit precision (which is mandatory in other approaches).

Several previous studies have developed techniques aimed at improved floating-point processing, and results have raised optimism that further progress is possible. For instance, Google’s bfloat16 format has the same number of exponent bits as the FP32 format, resulting in improved performance without hardware enhancements or other calculation-intensive techniques. But attempts at reducing precision to generate improved results must be carefully implemented. Specific conditions must be met, and custom tuning is necessary. As a result, there are (as of now) no one-size-fits-all solutions for manipulating FP methods.

Challenges to establishing an 8-bit Floating Point format are significant: Dynamic ranges are quite limited (from 2-16 to 216), while tensor distributions change over the course of training, spanning different orders of magnitude. As a result, a combination of different techniques must be used to capture the full dynamic range of values for model training. One technique is Loss Scaling, which artificially increases the size of gradients so they can fit in the FP8 range. Another method is Stochastic Rounding, which captures some of the information otherwise discarded during truncation to lower precision computations, thereby alleviating quantization errors. Between these two methods, Loss Scaling is the more critical, because once the magnitude of gradients can’t be represented in the FP8 range, training convergence is not possible.

It’s important to note that loss scaling requires user interaction, and that’s a serious issue. Not only do models have to be modified but (more critically), tedious empirical tuning is required to determine the correct loss scaling timing. Significant trial and error is needed to tune the scaling schedule. This alone has slowed the spread of low-precision numerical formats throughout the industry.


To address the issues described above, and to make neural network training possible with no model modifications or hyperparameter tuning, we propose a new 8-bit floating point format. Instead of directly encoding each numerical factor, this new format enables “squeezing” and “shifting” factors. 

Shifted and squeezed transformation


Our study compares two separate training instances. We used open source TensorFlow models for ResNet and NCF and Tensor2Tensor for Transformer repositories, with added S2FP8 data type simulation support. For any given model, the hyperparameters were kept consistent across FP32, FP8, and S2FP8 evaluations.

We compared S2FP8 training with baseline FP32 and FP8 training, with and without Loss Scaling for:

·       Residual Networks of varying depths on the CIFAR-10 and ImageNet datasets.
·       Transformer on IWSLT’15 English-Vietnamese dataset.
·       Neural Collaborative Filtering on MovieLens 1 Million dataset.

We then simulated S2FP8 by inserting appropriate truncation functions throughout the network, before and after every convolution and matrix-matrix product operation, and during both forward and backward passes. The remainder of the network was kept in FP32, and those truncations simulate the low-precision training described above. The truncation function takes a tensor X as input, computes its magnitude mean and maximum, computes appropriate values for α and β, and finally truncates X.

We first generated results with Residual Networks of varying depths on the CIFAR-10 images recognition database. S2FP8 reached almost exactly the FP32 baseline—and sometimes even improved on it. A second evaluation was then undertaken, looking at S2FP8 on the 1000 class ImageNet dataset. Tests were also performed on S2FP8 on a small Transformer (Transformer Tiny) on the English-Vietnamese dataset. S2FP8 reaches the baseline with no hyperparameter tuning (though FP8 does not, even after extensive Loss Scaling tuning).

In addition, testing was performed involving Neural Collaborative Filtering. The results were similar, with S2FP8 again easily reaching the baseline out-of-the-box, without any tuning, and FP8 approaching relatively close (but not reaching) the baseline.

A note on hardware: S2FP8 is a new data type and requires its own circuitry to be implemented in a tensor processing engine. However, the added overhead is very minimal and affects neither data throughput nor compute speed. In order to convert FP32 tensors into S2FP8, two hardware components are needed. One calculates each tensor’s statistics, and the other applies the “squeeze” and “shift” factors described above.


Results show that our novel 8-bit floating point data type (S2FP) performs competitively with state-of-the-art FP32 baselines across a range of typical networks. Squeezed and shifted factors moved and rescaled the range of tensors prior to truncation, enabling the training of neural networks with an 8-bit format, while eliminating the requirement for Loss Scaling tuning and complex hardware rounding techniques. The decrease in the number of bits requiring calculation enabled larger models to fit on a single device, resulting in faster training.

In short, these methods can deliver a more optimal balance of performance, memory utilization, and cost and power efficiency, while still maintaining model accuracy.

We plan future work aimed at extending the use of S2FP8 to training additional topologies. We also hope to simplify the squeeze and shift statistics from a hardware implementation point of view, and to explore reduced precision to efficiently represent a broader suite of low-precision formats, such as 8-bit POSIT, 4-bit floating, and integer data types.


FTC Optimization Notice

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

About the Author
Mary is the Community Manager for this site. She likes to bike, and do college and career coaching for high school students in her spare time.