Intel Labs researchers Moshe Wasserblat, Daniel Fleischer, and Moshe Berchansky collaborated on RAG-FiT and fastRAG. For SetFit, Intel Labs researchers Daniel Korat, Oren Pereg, and Moshe Wasserblat collaborated with research partners at Hugging Face and UKP Lab.
Highlights
- Intel Labs supports the AI developer community with open source AI frameworks, including the RAG-FiT and fastRAG libraries for optimizing retrieval-augmented generation (RAG) in GenAI models, and the SetFit tool for handling text classification with limited labels.
- Available under Apache 2.0 licenses, all three frameworks provide code, functions, and resources, including comprehensive examples and easy installation instructions.
- The three libraries are designed for efficient deployment on Intel hardware. In addition, the Optimum Intel open source library offers several techniques to accelerate models.
Generative AI (GenAI) is changing the way enterprises do business. These advancements have helped organizations enhance productivity, introduce new products, and improve operational efficiency. Over the past three years, Intel Labs has supported the AI developer community by releasing open source AI frameworks, including the RAG-FiT and fastRAG libraries for optimizing retrieval-augmented generation in GenAI models, and the SetFit tool for handling text classification of data with limited labels. Available under Apache 2.0 licenses, all three frameworks provide code, functions, and resources for building and deploying AI applications more efficiently.
Released in 2024, RAG-FiT integrates data creation, training, inference, and evaluation into a single workflow, assisting in the creation of data-augmented datasets for training and evaluating large language models (LLMs) using RAG. Introduced in 2023, fastRAG is designed to empower researchers and developers with a comprehensive toolset for streamlining RAG pipelines to reduce latency and improve throughput of retrieval and indexing. And finally SetFit, a popular release from 2022 developed with partners Hugging Face and UKP Lab, offers a proven solution that achieves high accuracy in text classification using data with little or no labels. Widely adopted by the AI developer community, SetFit has approximately 250,000 downloads per month from Hugging Face. The hub offers roughly 2,000 SetFit models — and that number continues to grow with the addition of four models per day on average.
Using RAG to Improve Model Accuracy and Provide Industry-Relevant Data
Despite their impressive capabilities, LLMs have inherent limitations. These models can produce plausible sounding but incorrect or nonsensical answers, struggle with factual accuracy, lack access to up-to-date information after their training cutoff, and struggle in attending to relevant information in large contexts. In addition, without the integration of specific enterprise and industry data, they often fail to provide the most value for business needs.
Retrieval-augmented generation bridges this gap by creating customized LLMs that can securely generate relevant business insights by tailoring AI outputs to the company’s market. RAG enhances LLM performance by integrating external information using retrieval mechanisms. When a user enters a query, RAG can retrieve context-rich information from multiple custom-built knowledge bases — often constructed from confidential company data such as health records, business plans, internal API documentation, customer history, and more. The consistent flow of user query, retrieval, and context incorporation in RAG applications creates a RAG pipeline.
By retrieving specific data from knowledge bases outside the model, this process can effectively address knowledge limitations. This in turn can reduce hallucinations, improve the relevance of generated content, provide interpretability, and improve cost efficiency. Furthermore, recent research indicates that fine-tuning LLMs for RAG can achieve state-of-the-art performance, surpassing larger proprietary models. Overall, RAG can improve text and video search, enable more dynamic and interactive document exploration, and even allow chatbots to reference detailed documents.
RAG-FiT: Rapid Prototyping and Experimentation
RAG-FiT is a Python-based framework that serves as an end-to-end experimentation environment, enabling researchers and developers to quickly prototype and test different RAG techniques. The framework enables easy experimentation of data selection, aggregation and filtering, retrieval, text processing, document ranking, few-shot generation, prompt design using templates, fine-tuning, inference, and evaluation.
Figure 1. In the RAG-FiT framework, the data augmentation module saves RAG interactions into a dedicated dataset, which is then used for training, inference, and evaluation.
The backbone of the library consists of four modules: data creation, training, inference, and evaluation. By combining these modules into a single workflow, RAG-FiT enables users to easily generate datasets and train RAG models using company or other specialized knowledge bases. The library assists in creating data to train models using parameter-efficient fine-tuning (PEFT), which allows users to fine-tune a subset of parameters in a model without retraining all the model’s weights and biases.
The RAG-FiT library can help users measure improved LLM performance using various RAG-specific metrics to assess retrieval accuracy and generative quality. To demonstrate the effectiveness of the RAG-FiT framework (formerly known as RAG Foundry), Intel Labs researchers augmented and fine-tuned Llama 3.0 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across TriviaQA, ASQA, and PubmedQA, three knowledge-intensive question-answering datasets.
FastRAG: Optimizing RAG Pipelines
FastRAG is a research framework for building efficient and optimized RAG pipelines, incorporating state-of-the-art LLMs and information retrieval. FastRAG is fully compatible with Haystack and includes novel and efficient RAG modules designed for efficient deployment on Intel hardware, achieving lower latency with comparable accuracy. The fastRAG GitHub repository provides extensive documentation on each component available in the framework, comprehensive examples, and easy installation instructions for optimized backends. The framework uses optimized extensions to popular deep learning frameworks such as PyTorch. One such extension is Optimum Intel, an open-source library that extends the Hugging Face Transformers library.
The fastRAG library can be used for generative tasks such as question answering, summarization, dialogue systems, and content creation, while utilizing information-retrieval components to anchor LLM output using company or other specialized knowledge bases. An application is represented by a pipeline, typically comprised of a knowledge base, retriever, ranker, and a reader — typically an LLM that “reads” the query and retrieved documents and then generates an output. Researchers can experiment with different architectures and models, benchmarking the results for performance and latency.
Additionally, fastRAG supports vector databases, which are useful for handling unstructured data such as text and images. With advancements in natural language processing and machine learning, RAG information retrieval has shifted toward denser representations using embeddings. Embedding models are useful for many applications such as retrieval, reranking, clustering, and classification. To build a knowledge base, data is ingested, processed into blocks, and then passed through an embedding model to convert the data blocks into vector representations that capture semantic relationships. The resulting vectorized data is stored in a scalable vector database to enable efficient retrieval. The shift from sparse to dense representations has significantly improved the performance and precision of retrieval systems.
SetFit: Efficient Few-Shot Learning Without Prompts
Few-shot learning using pretrained language models (PLMs) has emerged as a promising solution for dealing with data that has few to no labels. When a model is trained on a dataset lacking sufficient labels, it can lead to overfitting, where the model may memorize the training data instead of identifying patterns. This can cause poor generalization on unseen data, making the model unreliable in real-world applications.
SetFit, an efficient prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST), is ideal for text classification with limited labels. SetFit works by first fine-tuning a pretrained ST model on a small number of text pair examples using a contrastive Siamese method, in which two identical networks compare the semantic similarity and dissimilarity between sentence pairs. The resulting model then generates rich text embeddings, which are used to train a text classification head to convert the model's output into a format suitable for classification tasks.
This simple framework requires no prompts or verbalizers and does not require large-scale PLMs to achieve high accuracy. Current techniques for few-shot fine-tuning require handcrafted prompts or verbalizers to convert examples into a format suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from a small number of labeled text examples. For example, with only eight labeled examples in the Customer Reviews (CR) sentiment dataset in the SentEval benchmark, SetFit is competitive with fine-tuning on the full training set of 3,000 examples, despite the fine-tuned model being three times larger.
In addition, SetFit doesn't require large-scale models to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference. Since the framework provides multilingual support, text can be classified in multiple languages by simply fine-tuning a multilingual checkpoint.
By optimizing the SetFit model with Optimum Intel, users can accelerate inference with SetFit by 7.8x on Intel Xeon CPUs. The Optimum Intel open source library includes several techniques to accelerate models such as low-bit quantization, model weight pruning, distillation, and an accelerated runtime. Performing a simple post-training quantization step on the SetFit model can enable huge throughput gains.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.