Intel Labs Introduces RAG-FiT Open-Source Framework for Retrieval Augmented Generation in LLMs

ScottBair · ‎10-09-2024

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights

Intel Labs introduces RAG-FiT, an open-source framework for augmenting large language models for retrieval-augmented generation use cases.
Available under an Apache 2.0 license, RAG-FiT integrates data creation, training, inference, and evaluation into a single workflow to create data-augmented datasets for training and evaluating LLMs.
The Python-based framework serves as an end-to-end experimentation environment, enabling users to quickly prototype and experiment with different RAG techniques.

Intel Labs introduces RAG-FiT, an open-source framework for augmenting large language models (LLMs) for retrieval-augmented generation (RAG) use cases. Available under an Apache 2.0 license, RAG-FiT integrates data creation, training, inference, and evaluation into a single workflow, assisting in the creation of data-augmented datasets for training and evaluating LLMs in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources.

The library assists in creating data to train models using parameter-efficient fine-tuning (PEFT), which allows users to finetune a subset of parameters in a model. The Python-based framework is designed to serve as an end-to-end experimentation environment, enabling users to quickly prototype and experiment with different RAG techniques, including data selection, aggregation and filtering, retrieval, text processing, document ranking, few-shot generation, prompt design using templates, fine-tuning, inference, and evaluation.

To demonstrate the effectiveness of the RAG-FiT framework (formerly known as RAG Foundry), Intel Labs researchers augmented and fine-tuned Llama 3.0 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive question-answering tasks.

Using RAG Systems to Address LLM Limitations

Despite their impressive capabilities, LLMs have inherent limitations. These models can produce plausible sounding but incorrect or nonsensical answers, struggle with factual accuracy, lack access to up-to-date information after their training cutoff, and struggle in attending to relevant information in large contexts.

RAG enhances LLMs performance by integrating external information using retrieval mechanisms. Retrieving specific data from knowledge bases outside the model can effectively address knowledge limitations, which in turn can reduce hallucinations, improve the relevance of generated content, provide interpretability and could be vastly more cost efficient. Furthermore, recent research indicates that fine-tuning LLMs for RAG can achieve state-of-the-art performance, surpassing that of larger proprietary models.

How RAG-FiT Works

As an experimentation environment for researchers, the backbone of the RAG-FiT library consists of four distinct modules: data creation, training, inference, and evaluation. Each module is encapsulated and controlled by a configuration file, ensuring compatibility between the output of one module and the input of the next file. This modular approach allows isolation and independent experimentation on each step, enabling the production of multiple outputs and the concurrent execution of numerous experiments. Evaluation can be conducted on the generated outputs as well as on any feature within the data, including retrieval, ranking, and reasoning.

Figure 1. In the RAG-FiT framework, the data augmentation module saves RAG interactions into a dedicated dataset, which is then used for training, inference, and evaluation.

Dataset creation: The processing module facilitates the creation of context-enhanced datasets by persisting RAG interactions, which are essential for RAG-oriented training and inference. These interactions encompass dataset loading, column normalization, data aggregation, information retrieval, template-based prompt creation, and various other forms of pre-processing. The processed data can be saved in a consistent, model-independent format, along with all associated metadata, ensuring compatibility and reproducibility across different models and experiments.

The processing module supports the handling of multiple datasets at once through global dataset sharing. This feature allows each step of the pipeline to access any of the loaded datasets, enhancing flexibility and allowing for complex processing procedures. Furthermore, the module includes step caching, which caches each pipeline step locally. This improves compute efficiency, and facilitates easy reproduction of results.

Training: Users can train any model on the augmented datasets. A training module is used to fine-tune models from the datasets created by the previous processing module. The training module relies on the well-established training framework, TRL, for transformer reinforcement learning. The module also supports advanced efficient training techniques, such as PEFT and low-rank adaptation (LoRA) to customize the LLM for specific use cases without retraining the entire model.

Inference: The inference module can generate predictions using the augmented datasets with trained or untrained LLMs. Inference is conceptually separated from the evaluation step, since it is more computationally demanding than evaluation. Additionally, users can run multiple evaluations on a single prepared inference results file.

Evaluation: Custom metrics can be easily implemented or users can run current metrics, including Exact Match (EM), F1 Score, ROUGE, BERTScore, DeepEval, Ragas, Hugging Face Evaluate, and classification. Users can run metrics locally on each example, or globally on the entire dataset, such as recall for classification-based metrics. In addition to input and output texts, metrics can utilize any feature in the dataset, such as retrieval results, reasoning, citations, and attributions. In addition, the evaluation module uses a processing step called an Answer Processor, which can implement custom logic and perform many tasks, including cleaning and aligning outputs.

Performance of RAG-FiT Augmentation Techniques

To illustrate the utility of the framework, Intel Labs researchers conducted experiments involving retrieval, fine-tuning, chain-of-thought (CoT) reasoning, and a negative distractor documents technique. The team compared Llama 3.0 and Phi-3, two widely accepted baseline models, using enhancement methods across TriviaQA, PubMedQA, and ASQA, three knowledge-intensive question-answering datasets. The TriviaQA and PubMedQA datasets contain relevant context, while for the ASQA dataset, retrieval was done over a Wikipedia corpus using a dense retriever.

The team measured and reported EM for TriviaQA, STR-EM for ASQA, and accuracy and F1 Score for PubMedQA. In addition, researchers evaluated two Ragas metrics: faithfulness (the relation between the generated text and the context) and relevancy (the generated text and the query). Overall, the two models showed consistent improvements across the three knowledge-intensive question-answering tasks.

Figure 2. Evaluation results of baseline and different RAG settings for the three datasets and two models tested. In bold are the best configurations per dataset, based on the main metrics.

For TriviaQA, retrieved context improved the results, fine-tuning the RAG setting boosted the results, but fine-tuning on CoT reasoning (which includes training on a combination of gold passages and distractor passages) decreased performance. For this dataset, the best method is model dependent. For ASQA, every method improved upon the baseline, CoT reasoning produced consistent improvement in both models, as well as fine-tuning of the CoT configuration, which performed best. Finally, for PubMedQA, almost all methods improved upon the baseline (with one exception), CoT reasoning improved on the untrained RAG setting, but for fine-tuning, the RAG method performed best in both models.

Finally, the faithfulness and relevancy scores often did not correlate with the main metrics, or with each other, possibly indicating they capture different aspects of the retrieval and generated results, and represent a trade-off in performance.

The results demonstrate the usefulness of RAG techniques for improving performance, as well as the need to carefully evaluate different aspects of a RAG system on a diverse set of datasets.