Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
544 Discussions

Effective Weight-Only Quantization for Large Language Models with Intel® Neural Compressor

Ramya_Ravi
Employee
0 0 14.6K

Quantize Large Language Models with Just a Few Lines of Code

This article was originally published on medium.com

Posted on behalf of:

Mengni Wang, Xin He, Yuwen Zhou, Yiyang Cai, Kaokao Lv, Suyue Chen, Wenhua Cheng and Haihao Shen, Intel Corporation

As large language models (LLMs) become more prevalent, there is a growing need for quantization methods to maintain accuracy while reducing computational costs. Compared to traditional INT8 quantization on both activation and weight, weight-only quantization (WOQ) is a better tradeoff to balance performance and accuracy.

To support WOQ quantization, Intel® Neural Compressor provides unified APIs for state-of-the-art approaches like GPTQ [1], AWQ [2], and TEQ [3], as well as the simple yet effective round-to-nearest (RTN) approach:

Table 1.jpg

Besides the basic support from the original algorithms, Intel Neural Compressor has made considerable enhancements in quantization productivity (e.g., model coverage and new hardware support), helping customers accelerate LLM inference deployment. For example,

  • AWQ: improved model and architecture coverage with expanded hardware support
  • GPTQ: improved model and architecture coverage with more comprehensive calibration support
  • TEQ: a new approach inspired by AWQ with a trainable equivalent transformation that searches for the optimal quantization scaling factors

Intel Neural Compressor provides default quantization APIs for beginners and more flexible APIs for advanced users. The following sample code shows how to enable WOQ quantization:

code-weight-only-quantization-ai-blog.png

Refer to the documentation for detailed WOQ capabilities.

We validated 20+ LLMs on PyTorch and ONNX Runtime with 4-bit WOQ. All models reach comparable or even better accuracy than traditional INT8 quantization:

Table 2.jpg

 Accuracy results for Llama 2 models (See configuration details below <1>)

The accuracy and perplexity is measured by Lambada-OpenAI, a popular dataset available in LM-Evaluation-Harness. The previous table shows that INT4 accuracy has met 99% of FP32 accuracy for all the Llama 2 models. Moreover, INT4 models reduce model size by up to 8x, making LLM inference on memory-constrained devices (e.g., clients) possible and generative AI more accessible to everyone. For more details on all the validated models, refer to this link for PyTorch models and this link for ONNX models.

Important notes: INT4 Llama 2 ONNX models are available in Hugging Face:

Conclusion

We recently released Intel Neural Compressor v2.3, which offers WOQ features. We encourage you to try this out if you are looking for effective LLM quantization. You can also submit pull requests, issues, or questions on GitHub. Visit Intel Neural Compressor to learn more and get started.

We are committed to providing state-of-the-art LLM quantization techniques in Intel Neural Compressor and continue exploring new quantization recipes. The source code of SignRound [4], one of our works, will be publicly available soon.

We encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

Reference

[1] Frantar, Elias, et al. “Gptq: Accurate post-training quantization for generative pre-trained transformers.” arXiv preprint arXiv:2210.17323 (2022).

[2] Lin, Ji, et al. “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” arXiv preprint arXiv:2306.00978 (2023).

[3] Cheng, Wenhua, et al. “TEQ: Trainable Equivalent Transformation for Quantization of LLMs,” preprint under review (2023).

[4] Cheng, Wenhua, et al. “Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs.” arXiv preprint arXiv:2309.05516 (2023).

<1> Hardware Configuration: Intel® Xeon® Platinum 8480+ processor, two sockets with 56 cores per socket, 2048 GB RAM (16 slots/128 GB/4800 MHz), HT:on. OS: Ubuntu* 22.04.2LTS; Testing date: 08/18/2023; Software Configuration: Python 3.8, NumPy 1.24.4, ONNX Runtime 1.15.1, ONNX 1.14.0, Optimum 1.11.2.dev0, Transformers 4.32.0.dev0.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.

 

 

 

About the Author
Product Marketing Engineer bringing cutting edge AI/ML solutions and tools from Intel to developers.