Making Vector Search Work Best for RAG

Cecilia_Aguerrebere · ‎06-02-2025

This technical deep dive was written by Cecilia Aguerrebere, and is a joint work with Alexandria Leto, Vy Vo, Ishwar Bhati, Ted Willke and Mariano Tepper as part of their research efforts while at Intel Labs.

Summary

The previous post in this series introduced LeanVec, a framework that combines dimensionality reduction with vector quantization to accelerate vector search on high-dimensional vectors while maintaining high accuracy in settings with out-of-distribution queries. This post summarizes insights from our study on optimizing vector search settings in RAG systems and offers actionable guidelines for improving RAG pipeline efficiency and effectiveness.

Introduction

Retrieval-Augmented Generation (RAG) is a cutting-edge approach that merges the capabilities of information retrieval with large language models. At its core, RAG operates by first conducting a targeted search to fetch pertinent pieces of information from a vast dataset or knowledge base in response to a given query. This retrieved data then serves as a contextual scaffold for a generative model to synthesize detailed and informed responses.

RAG is highly relevant in today's digital landscape, where managing vast amounts of data is crucial. It enables AI to provide nuanced, context-aware interactions, enhancing chatbots and virtual assistants by delivering richer, fact-based content. A major advantage of RAG is its scalability; it can easily expand its knowledge base with updated datasets, improving performance without extensive retraining. Additionally, RAG's flexible information sources allow for domain-specific customization, offering expert responses across various fields.

The retriever's impact on RAG pipeline performance is not fully understood, leading to suboptimal use of vector search. In this blog, we summarize insights from our experimental study on optimizing vector search settings in RAG systems (for details, check out our paper). We explore configurations like the number of documents to retrieve and the trade-off between search accuracy and speed. Our goal is to offer actionable guidelines to improve RAG pipeline efficiency and effectiveness.

Figure 1. Retrieval-Augmented Generation (RAG) pipeline.

Vector Search for RAG

A RAG system consists of two main components: a retriever and a reader. The retriever is responsible for creating a semantic embedding space, where it converts queries and relevant documents (known as gold documents) into vectors that are close to each other within this space. Dense retriever models are preferred over traditional sparse ones because they perform better in generating high-quality semantic embeddings. To find matching document embeddings for a given query, nearest neighbor search algorithms are employed within an external knowledge base. The reader, typically a large language model (LLM), uses these matched documents to enhance its responses. Due to the high dimensionality of modern embeddings, often exceeding 1024 dimensions, performing an exhaustive search is impractical for large datasets like Wikipedia's 170GB corpus. To expedite retrieval, RAG systems utilize approximate nearest neighbor (ANN) search methods, which sacrifice some search accuracy—meaning some retrieved documents might not be the exact top-k nearest neighbors—in exchange for faster results.

The Importance of the Retrieval Model

To enhance the performance of large language models in specific tasks, it is essential that the retrieved documents contain the necessary information to answer the query, as illustrated in Figure 2. This means that the query and the relevant documents must have similar vector representations within the retriever model's embedding space. If the retriever model fails to map the query and its gold documents to closely aligned vectors, even an exhaustive vector search would be ineffective because the gold documents would not be among the top-k nearest neighbors; other vectors would be closer to the query. Therefore, choosing a retrieval model that generates a rich embedding space, where gold documents are among the closest vectors to the query, is vital for the success of a RAG system. Fine-tuning the retrieval model for specific applications might be necessary. The quality of the retriever can be assessed using queries paired with gold documents curated by human annotators. Metrics such as precision, recall, and F1 score are used to evaluate effectiveness, with high alignment indicating better performance. An inadequate retriever can lead to poor vector search results, potentially confusing the LLM and diminishing task performance.

Figure 2. Gold document recall, the percentage of gold documents retrieved and used in the LLM context, strongly predicts RAG downstream task performance, measured by exact-match recall (EM recall) in question-answering tasks. This highlights the importance of choosing a retrieval model that effectively retrieves relevant documents to enhance RAG performance. The RAG pipeline employs Mistral-7B-Instruct-v0.30 as the reader and BGE-base as the retrieval model, with results shown for three datasets: Natural Questions (with and without citations) and ASQA. The shaded bar indicates the maximum performance achievable using all available gold documents per query, while error bars represent 95% bootstrap confidence intervals across queries [1].

Optimizing Vector Search for RAG

When optimizing vector search for RAG systems, two crucial parameters are the number of documents to retrieve and the search accuracy, which is measured by the search recall and refers to the percentage of true top-k nearest neighbors identified by the ANN search method. Retrieving more documents can enhance the LLM's context but also increase retrieval and inference time. High search accuracy ensures the correct top-k nearest neighbors are found, but it requires longer search times. Lower search accuracy speeds up the process by allowing earlier termination of the search and enabling more aggressive vector compression. While vector compression introduces errors in distance calculations, making the search less accurate, it offers two significant benefits: it improves memory bandwidth usage, which is vital for memory-bound implementations like graph-based search, and it reduces memory requirements, lowering overall system costs. Determining the minimum search accuracy level at which a RAG system can operate effectively is thus important, as it can lead to faster search with lower memory usage and reduced system costs. To explore these aspects, we experimented with two instruction-tuned LLMs, LlaMA (Llama-2-7b-chat) and Mistral (Mistral-7B-Instruct-v0.30), across three question-answering datasets (Natural Questions, ASQA, and QAMPARI). For single vector embeddings, we used the BGE-base (BEIR15 score of 0.533) retrieval model and the Intel Scalable Vector Search library for efficient dense retrieval. For multi-vector search, we employed ColBERTv2 (BEIR15 score of 0.499).

How Many Documents Should be Included in the LLM Context?

The optimal number of context documents depends on various factors, such as the reader LLM, the retriever, and the task. However, as shown in previous research and verified by our experiments, question-answering correctness starts to plateau around five to 10 documents, making this a good reference point (see results in Figure 3 below). Retrieving too many documents can introduce noise and increase prompt size, resulting in slower inference times.

Figure 3. Question-answering correctness achieved by Mistral-7B-Instruct-v0.30 with various numbers of documents retrieved with BGE-base and ColBERTv2 on different datasets. Question-answering performance starts to plateau around 5 to 10 documents. The green dashed line shows the ideal performance achieved with all gold documents included in the prompt, and the red line shows the performance without any documents. Similar results were observed for Llama-2-7b-chat [1].

Do Noisy Documents Help?

Previous research suggests that mixing a few gold documents with random ones can improve LLM task performance by creating a contrastive effect that aids in extracting accurate information. If generalizable, this could enhance RAG pipeline performance at minimal cost. However, our experiments did not replicate these findings, as no settings outperformed using only gold documents, challenging the generalizability of earlier results.

How Accurate Does the Approximate Nearest Neighbor Search Need to Be?

The answer to this question will depend on the application, the reader and retriever models, etc. In our experiments, reducing the search accuracy has minimal impact on how well the RAG pipeline performs, that is, on the quality of the answers provided to the given queries. This is because the retrieval rate of gold documents – what percentage of the total gold documents are among the retrieved nearest neighbors – remains consistent across different search recall levels; for example, lowering search recall from 1.0 to 0.7 results in a similar number of retrieved gold documents (36% vs 38%), with negligible effect on overall RAG performance (see Table 1 for the ASQA dataset). That is, only 36% of the gold documents are among the real top-k nearest neighbors (search recall 1.0), and retrieving only 70% of the nearest neighbors (search recall 0.7) still provides 38% of the gold documents.

Figure 4 below demonstrates that using all available gold documents can significantly enhance RAG performance. However, if the retriever model does not rank these documents among the top-k neighbors, the vector search method will not retrieve them, even when operating at perfect search accuracy. Therefore, maintaining a search accuracy above 0.7 is unnecessary in this case. This underscores the importance of having a retriever model that effectively maps the query and as many gold documents as possible to closely aligned positions in the embedding space.
It is important to evaluate this in practice, and if lower search recall proves viable for a specific application, it should be preferred, as it allows for faster and potentially more memory-efficient retrieval through more aggressive vector compression.

Figure 4. In our experimental setup, the RAG downstream task performance is only slightly affected by search recall (percentage of exact nearest neighbors that are retrieved and injected into the LLM context). The RAG pipeline employs Mistral-7B-Instruct-v0.30 as the reader and BGE-base as the retrieval model, with results shown for three datasets: Natural Questions (with and without citations) and ASQA. The shaded bar is the ceiling performance using all gold documents available per query in the given dataset. The error bars are 95% bootstrap confidence intervals across queries [1].

Search recall@10	Document Recall@10
	ASQA	NQ
1.0 (exact search)	0.387	0.279
0.95	0.377	0.264
0.90	0.363	0.274
0.70	0.361	0.245

Table 1. Average gold document recall for the BGE-base retriever at different ANN search recall levels.

Key Insights:

The effectiveness of a RAG system heavily depends on the retriever model's ability to map queries and relevant documents to closely aligned vectors in the embedding space. Fine-tuning the retriever model may be necessary for optimal performance.
Key parameters for optimizing vector search in RAG systems include the number of documents to retrieve and the search accuracy.
Experiments with LlaMA and Mistral LLMs across QA datasets showed that retrieving five to 10 documents is enough, as too many documents can introduce noise. Mixing gold documents with random ones did not improve performance, challenging previous findings.
If lower search recall is suitable for a specific application, it should be favored, as it enables faster and potentially more memory-efficient retrieval through strong vector compression.

References

[1] Alexandria Leto, Cecilia Aguerrebere, Ishwar Bhati, Ted Willke, Mariano Tepper and Vy Ai Vo. Toward Optimal Search and Retrieval for RAG. November 2024. arXiv: 2411.07396v1 [cs].

[2] Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for RAG systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024. ACM, July 2024