Accelerating Language Models: Intel and Microsoft Collaborate to Bring Efficient LLM Experiences.

GurmanSingh · ‎05-22-2024

Authors: Szymon Marcinkowski and Hariharan Srinivasan

In the rapidly evolving landscape of AI, the quest for efficient inference solutions on client platforms is ever-present. With Large Language Models (LLM) demanding increased compute and memory capabilities, Intel and Microsoft have joined forces to propel AI workloads with solutions to enable LLMs on a vast range of Intel client platforms. For developers working within the Microsoft ecosystem, DirectML is a natural choice and extension of their development toolchain for AI workloads.  Microsoft and Intel have collaborated to optimize DirectML GPU support for the INT4 AWQ (Activation-aware Weights Quantization) compression. This collaboration enables running SLMs and LLMs on all Intel® ArcTM Graphics and Intel® Iris® Xe Graphics with DirectML.

Unlocking the potential of INT4 AWQ for LLMs in Windows

DirectML with support for INT4 AWQ is a significant milestone in the journey towards efficient AI acceleration on edge for demanding LLM workloads like Llama2-7b, Phi-2, or Mistral-7b. By applying AWQ, LLMs can be processed with great efficiency on a broader set of Intel systems.

INT4 weights AWQ	Benefits
Decreased memory footprint.	Enabled for systems with smaller RAM sizes.
Lower memory bandwidth requirements.	Increased inference performance.
Internal computations performed in full precision.	Minimal quality degradation

AWQ (Activation-aware Weights Quantization) (Lin et al., 2023) technique compresses weights down to 4 bits (from 16-bit half-precision float). The quantization process considers that not all the weights are equally important and aims to protect ~1% of salient weights, which greatly reduces quantization errors. Memory usages are reduced thanks to the compressed weights, which leads to improved inference performance. The quality of original model responses is mostly preserved due to computations which are still performed in full precision.

Figure 1. Data collected on Intel® Core™ Ultra 7 (MTL-H)

Smooth LLM User Experience on Intel® Arc™ Graphics with DirectML

Deploying LLM models efficiently is still a significant challenge, particularly in terms of converting the model from a training framework into an optimized version that is tailored for performant inference. Microsoft’s Olive toolchain is a robust solution, streamlining these processes while ensuring a smooth user experience. With a simple and intuitive experience, users can get the best ONNX versions of their models, which are ready to be executed within the ONNX Runtime environment. These models can be consumed by the DirectML Execution Provider, which is accelerated by Intel GPUs.

Microsoft has recently updated Olive example to present how to perform optimization and quantization (with the usage of Intel Neural Compressor) for a broad set of LLM models, including:

Llama2-7b
Mistral-7b
LLava-7b
Openchat-7b-3.5
Phi-2
Phi-3-mini

Intel is proud to announce DirectML support for the latest revolutionary small language model, Phi-3 (4K and 128K context lengths), on systems starting from 11th Gen Intel® Core™ processors. This cost-effective language model is a significant step forward when it comes to inferencing AI on the edge—its INT4 AWQ version requires only up to 2GB in memory, which makes it accessible by a wide range of devices, including compute and memory-limited environments.

Intel® Graphics Driver Support

With the newest graphics driver update, Intel is proud to enable DirectML for FP16 and INT4-AWQ LLM models on a broad spectrum of integrated and discrete GPUs, including integrated GPUs starting from 11th Gen Intel Core processors and discrete Intel® Arc™ Graphics GPUs. By using the computation power of integrated GPUs and DirectML, users can now harness the potential of LLMs across a diverse range of existing devices, from laptops to desktops, without the need for specialized hardware or costly upgrades.

The Intel Graphics Driver (starting version 31.0.101.5522 or later) with this support for Intel® ArcTM graphics can be downloaded here.

What’s next?

The journey of optimization and innovation never ends. The future promises exciting possibilities for enhanced performance, and Intel®, together with Microsoft, stays dedicated to supporting DirectML workloads and pushing performance to the boundaries of what is achievable. Users can expect more Intel driver updates with enhanced functionality and performance uplifts for a broad set of emerging workloads.

Figure 1 Configuration data:

Intel® Core™ Ultra: Measurement on an MTL-H Intel Internal platform using 32GB (2x 16GB) DDR5 5600Mhz, Intel graphics driver 101.5382, Windows 11 Pro version 22621, Performance power policy, and core isolation disabled. Test by Intel on April 10th, 2024.

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software, or service activation.