Neural Compressor: Boosting AI Model Efficiency

Freddy_Chiu · ‎06-07-2024

Authors: Tai Huang, Haihao Shen, Suyue Chen, Feng Tian, Mengni Wang, Yuwen, Zhou, Saurabh Tangri, Freddy Chiu

In the age of the AI PC, AI-infused applications will become the norm, and developers are increasingly replacing traditional code fragments with AI models. This accelerating trend is unleashing exciting user experiences, enhancing productivity, providing new tools for creators, and enabling seamless and natural collaborative experiences.

To meet the computing demand for these models, AI PCs are providing the foundational computing blocks to enable these AI experiences with the combination of CPU, GPU (Graphics Processing Unit), and NPU (Neural Processing Unit). However, to fully take advantage of an AI PC and each of these computing engines to provide the most optimal user experiences, developers need to compress these AI models, a non-trivial task. To help tackle this problem, Intel is proud to announce we are embracing the open-source community and have made the Neural Compressor utility available under the ONNX project.

What is the Neural Compressor?

Neural Compressor aims to provide popular model compression techniques inherited from Intel Neural Compressor. It is a simple, yet sophisticated utility designed to optimize neural network models represented in the Open Neural Network Exchange (ONNX) format. As leading open standard for AI model representation, ONNX models allows seamless interoperability across different frameworks and platforms. Now, with the Neural Compressor, we take ONNX to the next level.

Why Does It Matter?

As AI continues to permeate our daily lives, efficiency becomes paramount. Whether you’re building recommendation engines, natural language processors, or computer vision applications, squeezing the most out of your hardware resources is crucial. The Neural Compressor achieves this by:

Reducing Model Footprint: Smaller models mean faster inference times, lower memory consumption, and quicker deployment. These are characteristics that are critical for running your AI-powered app on the AI PC without compromising performance. In cloud and server environments, smaller models mean less data transfer, lower latency, higher throughput, which translate to cost savings.
Faster Inference: The Neural Compressor optimizes model weights, pruning unnecessary connections, and quantizing parameters. This translates to lightning-fast inference with AI acceleration capabilities like embedded in on Intel CPUs (Intel DLBoost), GPUs (Intel XMX), and NPUs (Intel AI Boost) on Intel Core Ultra.

Benefits for AI PC Developers

Faster Prototyping: Model quantization and compression is hard! Neural Compressor helps developers quickly iterate on model architectures through developer-friendly APIs to easily apply state-of-the-art quantization techniques like SmoothQuant and 4-bit weight-only-quantization.
Improved User Experience: Your AI-powered applications will respond swiftly, delighting users with seamless interactions.
Easy deployment with ONNX-compliant models, enabling out-of-the-box support for deployment on CPU, GPU, and NPU with native Windows APIs.

What’s Next?

As part of the ONNX project, we look forward to collaborating with the developer community and augmenting synergies in the ONNX ecosystem. Visit the Neural Compressor GitHub and try the tool! GitHub - onnx/neural-compressor.