Scalable Vector Search: Deep Dive Series

Cecilia_Aguerrebere · ‎07-01-2025

These technical deep dives were written by Cecilia Aguerrebere, Mariano Tepper, Alexandria Leto, Vy Vo, Ishwar Bhati, Mark Hildebrand, and Ted Willke as part of their research efforts while at Intel Labs.

Highlights:

Vector search is at the core of the AI revolution, and this blog series is here to teach you all about it.
Our blog series introduces scalable vector search and dives deeper into sector compression, dimensionality reduction, and Retrieval-Augmented Generation systems.

In recent years, high-dimensional vectors have become the quintessential data representation for unstructured data, such as images, audio, video, text, genomics, and computer code. These vectors are generated so that semantically related vectors are close to each other according to some similarity function. The task of successfully retrieving the most similar vectors to a query – sometimes from billions of vectors – is known as vector search. Vector search is relevant to an ever-growing range of applications, including image generation, natural language processing, question answering, recommender systems, and ad matching. However, many billion-scale similarity search systems lack the high-performance computing block-and-tackling necessary to produce sufficient computation performance and memory savings. In response to this, Intel Labs developed Intel® Scalable Vector Search (Intel SVS), which features highly optimized implementations of vector indices and new vector compression techniques. See Intel SVS in action below.

Figure 1. Using the latest generation of the Intel® Xeon® 6972P Processor (96C), this video compares the Intel SVS library to the widely used open source vector search library HNSWlib, showing a performance benefit of over 8x in latency over HNSWlib when using 45 million vector embeddings with 1536 dimensions.

Want to learn more? Our recent blog series covers:

An Introduction to Scalable Vector Search
Sector Compression
Dimensionality Reduction
RAG Systems

Read on for a summary of each blog post.

Intel Scalable Vector Search: A Performance Library for Large-Scale Similarity Search

The first post in the series introduces scalable vector search, its relevance in today’s world, and the metrics used to characterize it. It also details Intel Scalable Vector Search (SVS), which features highly optimized implementations of vector indices and new vector compression techniques. Intel SVS provides vector similarity search on billions of high-dimensional vectors at state-of-the-art speed, with high accuracy and using less memory than its alternatives. Check out the first blog in our series to learn more and see a demo of SVS in action!

Vector Quantization for Scalable Vector Search

Vector search has become a critical component in a wide range of applications that demand highly accurate and timely answers. This blog introduces a novel vector compression algorithm, Locally-adaptive Vector Quantization (LVQ). LVQ uses a simple and efficient compression method to keep SVS and the system operating in a memory-intensive regime. After centering the data, LVQ scales each vector individually (i.e., the local adaptation) and then performs uniform scalar quantization. Although not optimized for million-scale search, SVS with LVQ achieves great vector search performance using less memory and yields state-of-the-art results.

Dimensionality Reduction for Scalable Vector Search

This post presents LeanVec, a framework that combines dimensionality reduction with vector quantization to accelerate vector search on high-dimensional vectors while maintaining high accuracy in settings where queries are out of distribution, such as text2image and question-answering applications. Furthermore, when paired with the Intel SVS performance library, LeanVec achieves great vector search performance using less memory.

Making Vector Search Work Best for RAG

Retrieval-Augmented Generation (RAG) systems are rapidly changing the landscape of AI, and the demand for them is driving the rapid deployment of large-scale systems that are inherently expensive to own and operate. Vector search plays an important role in both the performance and cost of hardware and software systems. In small pre-production deployments of RAG, the vector search time is relatively quick compared to LLM inference and is often overlooked. However, vector search can become a bottleneck when scaling deployments to handle production traffic on large vector collections. This post summarizes insights from our study on optimizing vector search settings in RAG systems and offers actionable guidelines for improving RAG pipeline efficiency and effectiveness.