Boosting LLM Chat Performance: Intel® Data Center Flex GPUs & SW Optimizations for RAG (Part 1 of 2)

Mohan_Potheri · ‎05-31-2024

In an earlier blog series, we looked at how Intel worked with a customer, Twixor, to leverage Intel SW optimizations and Intel® AMX for LLM-based chat applications. As a follow up, Intel worked with Twixor on deploying the model with Intel® Data Center Flex GPUs and query optimization with retrieval augmented generation (RAG). In this two-part blog series, we will look at how Intel worked with Twixor to deploy and optimize LLM chat with Intel GPUs and RAG.

An Introduction to Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that enhances the capabilities of large language models (LLMs) by integrating external knowledge sources into the text generation process. This approach addresses some of the inherent limitations of LLMs, such as outdated information, hallucinations, and lack of domain-specific knowledge.

Key Components of RAG

Retriever Component

The retriever is responsible for fetching relevant information from a large corpus of documents or a knowledge base. This is typically done using a neural retriever, such as the Dense Passage Retriever (DPR), which employs a bi-encoder architecture with BERT-based document and query encoders. The retriever converts both the query and documents into dense vector representations, allowing for efficient retrieval based on semantic similarity.

Generator Component

The generator, often a pretrained sequence-to-sequence (seq2seq) model like BART, uses the retrieved documents to generate the final output. The generator conditions the input query and the retrieved documents to produce more accurate and contextually relevant responses. This integration helps generate factually correct and contextually enriched text.

How RAG Works:

Query Embedding: The input query is converted into a dense vector representation using a query encoder.
Document Retrieval: The query vector is used to search a dense vector index of documents (e.g., Wikipedia), retrieving the top relevant documents.
Response Generation: The retrieved documents are concatenated with the original query and fed into the generator model, which produces the final output.

Example Workflow:

Input Query: "What are the benefits of RAG in NLP?"
Retrieval: The retriever fetches relevant documents from a knowledge base.
Generation: The generator uses these documents to produce a detailed and accurate response.

Benefits of RAG

Enhanced Accuracy: By incorporating UpToDate and domain-specific information, RAG improves the factual accuracy of generated text.
Reduced Hallucinations: The reliance on external knowledge sources helps mitigate the issue of hallucinations, where the model generates plausible but incorrect information.
Adaptability: RAG models can be easily updated with new information without retraining the entire model, making them highly adaptable to changing knowledge bases.

Applications of RAG

Open-Domain Question Answering: RAG models excel in tasks that require access to a vast amount of knowledge, such as answering questions from a large corpus of documents.
Fact Verification: By retrieving relevant documents, RAG can verify the factual accuracy of statements, making it useful for fact-checking applications.
Customer Support: RAG-powered chatbots can provide more accurate and contextually relevant responses by accessing up-to-date information from a company's knowledge base.

Components of a RAG Pipeline

Pretrained Language Model: Use a generative pretrained transformer (GPT) or Bidirectional Encoder Representations from Transformers (BERT) as the base model. These models are trained on vast amounts of text data and can understand and generate humanlike text.

Retrieval Mechanism: Implement a retrieval mechanism to fetch relevant information from a knowledge base. Techniques like Okapi BM25 or dense vector representations (embeddings) are commonly used for this purpose.
Knowledge Base: The knowledge base can be a database, a collection of documents, or a curated set of web pages. This repository should be regularly updated to ensure the information is current and relevant.
Embedding Models: Convert documents and user queries into numerical representations using embedding models. This step is crucial for performing relevancy searches and matching queries with the most relevant documents.
Vector Database: Store the embeddings in a vector database optimized for fast and accurate retrieval operations. Examples include FAISS, Pinecone, and Milvus.

Steps to Build a RAG Pipeline

Data Ingestion and Preprocessing
Collect raw data from diverse sources such as databases, documents, or live feeds. Use document loaders to handle various formats like PDFs, text files, and emails. Break down long texts into smaller segments to fit the embedding model's maximum token length. This step ensures efficient processing and retrieval.
Embedding Generation
Convert the preprocessed text into high-dimensional vectors using embedding models like Sentence Transformers or General Text Embeddings (GTE) from Hugging Face.
Storage in Vector Database
Store the generated embeddings in a vector database. This setup allows for efficient similarity searches and quick retrieval of relevant documents.
Query Processing
Convert user queries into embeddings and perform a similarity search against the vector database to find the most relevant documents.
Context Augmentation
Combine the retrieved documents with the original user query to create context-rich input for the language model. This step ensures the model has access to relevant information before generating the output.
Text Generation
Use the pre-trained language model to generate responses based on the augmented prompts. The output should be accurate, contextually relevant, and grounded in the retrieved information.
Continuous Learning and Improvement
Implement a feedback mechanism to refine and improve the RAG system over time. Collect user feedback and use it to fine-tune the retrieval and generation components.

In part 2 of the blog series, we will look at the LLM platforms used for RAG deployment and Intel Flex GPUs used for the chat function for Twixor.

Disclaimers:

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

The analysis in this document was done by VLSS and commissioned by Intel.