Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
791 Discussions

Intel Labs’ Innovative Low-Rank Model Adaptation Increases Model Accuracy and Compression

Pablo_Munoz
Employee
2 0 4,106

J. Pablo Muñoz is a research scientist at Intel Labs, where he leads the research on compression and fine-tuning techniques to improve model performance on the emerging visual AI systems team. Co-authors Nikolay Lyalyushkin is a research engineer with the OpenVINO Neural Network Compression Framework team, where he leads the development of solutions for model optimization, Jinjie Yuan is a deep learning engineer specializing in natural language processing applications, and Nilesh Jain is a principal engineer who leads the research on emerging visual-AI systems at Intel Labs.

Highlights

  • Intel Labs’ Neural Low-Rank Adapter Search (NLS) produces accurate models with INT4 weights and is available in OpenVINO’s Neural Network Compression Framework.
  • This solution achieves improved accuracy in downstream tasks by incorporating neural architecture search (NAS) techniques into parameter-efficient fine-tuning (PEFT).
  • By employing elastic low-rank adapters (LoRA), researchers can identify optimal adapter configurations for compressing and fine-tuning artificial intelligence (AI) models.

Intel Labs researchers explored the synergy between parameter-efficient fine-tuning techniques and neural architecture search to enhance traditional fine-tuning and compression techniques for large language models (LLMs), yielding several solutions that improve model accuracy and efficiency. One solution called Neural Low-Rank Adapter Search combines low-rank adaptation with neural architecture search (which automatically finds optimal model structures), offering a promising path toward more accurate and efficient AI systems. Now available in the OpenVINO Neural Network Compression Framework (NNCF), NLS-trained adapters enable the exploration of more effective adapter configurations. This allows the use of larger compressed models in resource-constrained environments while still maintaining high accuracy on targeted tasks.

LLMs require significant resources for pre-training. To improve their performance on specialized downstream tasks, these models often undergo additional fine-tuning stages, in which they are adapted to a target dataset. Parameter-efficient fine-tuning techniques, such as low-rank adapters, have been proposed to efficiently adapt these models by updating only a fraction of model weights. For NLS, researchers proposed elastic LoRA adapters for dynamic rank adjustment, enabling the fine-tuning stage to automatically determine the optimal rank for various tasks.

The developed solutions benefit from parameter-efficient fine-tuning and the application of neural architecture search to optimize the AI model structures. The proposed elastic adapters operate by maintaining multiple possible ranks and widths during training, enabling optimal low-rank search, flexible compression ratios (pre-trained weights), and adapter-guided structure removal, depending on the adapter elasticity mode, as illustrated in Figure 1. Making traditional LoRA adapters elastic allows for the exploration of a search space for potential better model and adapter configuration

Figure 1 Low rank adapters Intel Labs.png

Figure 1. The static LoRA adapter and its elastic counterparts (modes A and B) for enhancing model fine-tuning and compression.

Effective Fine-tuning and Model Compression

Elastic adapters operate at different granularities. Mode A in Figure 1 illustrates the application of elasticity at the rank configuration. This is the mode used by NLS, enabling effective model fine-tuning that achieves improved accuracy compared to the original low-rank adapters. Researchers have applied these adapters to recover the accuracy of sparse models with low numerical precision. As illustrated in Figure 2, these techniques can be applied to various fine-tuning pipelines to improve the downstream accuracy of compressed models. Researchers also proposed SparsePEFT to align the sparsity and weight compression patterns of models and adapters. 

Figure 2 Low rank adapters Intel Labs.png

Figure 2. Several compression and fine-tuning pipelines (SQFT) utilize NLS to enhance the final model’s downstream accuracy and facilitate the merging of adapters.

Elastic adapters mode B, on the right of Figure 1, are applied to additional dimensions, such as the width of the adapters, enabling smaller adapters and potentially using their configurations to guide the slicing of portions of the model’s frozen parameters (pre-trained weights). Researchers have applied these modalities of elastic adapters to guide and improve the efficiency of neural architecture search on large pre-trained models. Figure 3 illustrates this approach called Low-rank Neural Architecture Search (LoNAS), which yields compressed AI models while fine-tuning them for downstream tasks. LoNAS creates weight-sharing super-networks for both adapters and frozen layers, achieving efficient fine-tuning while enabling model compression. The elasticity of adapters guides the removal of components from the base model, effectively making the original layers elastic and suitable for the structured removal of subcomponents. The resulting models exhibit inference acceleration. 

Figure 3 Low rank adapters Intel Labs.png

Figure 3. LoNAS: Another Intel Labs solution for efficient fine-tuning and compression of large pre-trained models that utilizes elastic adapter mode B to guide neural architecture search and obtain compressed and fine-tuned models with inference acceleration.

Neural Low-Rank Adapter Search in OpenVINO’s Neural Network Compression Framework

NLS (elastic adapters mode A) has been implemented in OpenVINO’s NNCF, allowing users to fine-tune and compress their models using these effective techniques. Figure 4 illustrates the fine-tuning pipeline in NNCF that results in a quantized fine-tuned model for a downstream task. Table 1 shows the results obtained using this pipeline on 11 large language models and four downstream tasks. NLS often outperforms static LoRA adapters, and by applying a set of heuristics, it avoids the optional adapter configuration search stage.

Figure 4 Low rank adapters Intel Labs.png

Figure 4. Fine-tuning pipeline in NNCF utilizes NLS and absorbable adapters to enhance the final model’s downstream accuracy for models with INT4 weights.

Table 1 Low rank adapters Intel Labs.png

Table 1. Results from the NLS pipeline in NNCF on 11 large language models and four downstream tasks. NLS often outperforms vanilla LoRA adapters. NLS applies a set of heuristics to avoid an optional but costlier adapter configuration search that might yield even better results.

Learn more about this research in our AAAI, EMNLP, and COLING papers. Our research has also been featured in MarkTechPost. Try NLS for obtaining accurate models with INT4 weights in OpenVINO’s Neural Network Compression Framework.


References

Hu, E. J.; Shen, Y; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/pdf?id=nZeVKeeFYf9

White, C.; Safari, M.; Sukthanker, R.; Ru, B.; Elsken, T.; Zela, A.; Dey, D.; and Hutter, F. 2023. Neural Architecture Search: Insights from 1000 Papers. https://arxiv.org/abs/2301.08727

Mangrulkar, S.; Gugger, S.; Debut, L.; Belkada, Y.;Paul, S.; and Bossan, B. 2022. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft

About the Author
J. Pablo Muñoz is an AI Research Scientist at the Systems, Software, and Architecture Research Group at Intel Labs. His research interests primarily include techniques to improve model efficiency and performance. Pablo has successfully led the design and development of Parameter-Efficient Fine-Tuning (PEFT) solutions, which combine Neural Architecture Search (NAS) and have been integrated into Intel's OpenVINO NNCF. His research has been published at top-tier conferences, including AAAI, NeurIPS, NAACL, EMNLP, and AutoML. Pablo has also contributed to projects in large-scale video analytics, intelligent agentic middleware systems for dynamic allocation of computer vision pipelines, large-scale 3D skeletal reconstruction, and video curation. He received his Ph.D. from the City University of New York, where he successfully led the design and development of award-winning localization systems to assist visually impaired individuals in reaching indoor destinations.