Intel Scalable Vector Search: A Performance Library for Large-Scale Similarity Search

Scott_Bair · ‎02-04-2025

This technical deep dive was written by Mariano Tepper and Ted Willke as part of their research efforts while at Intel Labs.

Summary:

Vector search is at the core of the AI revolution. It provides applications with access to semantically relevant unstructured data. Scalable Vector Search (SVS) is a performance library for billion-scale similarity search that offers high-performance computing optimizations, along with vector compression and dimensionality reduction techniques. The compute optimizations put the pressure back on memory, where the new vector compression and dimensionality reduction techniques can reduce the quanta of fetched data, providing additional speedup. Read on to learn how SVS establishes the state-of-the-art in terms of search performance and memory footprint. Future blogs in this series will dive deeper into vector compression, dimensionality reduction, and RAG systems.

In the last few years, high-dimensional vectors have become the quintessential data representation for unstructured data, such as images, audio, video, text, genomics, and computer code. These vectors, often called embeddings, are generated so that semantically related vectors are close to each other according to some similarity function.

Retrieving the most similar vectors to a query, among potentially billions of vectors, is a ubiquitous problem known as vector search (a.k.a. similarity search). It is relevant to an ever-growing range of applications, such as image generation, natural language processing, question answering, recommender systems, and ad matching.

The R in RAG

In recent years, Retrieval-Augmented Generation models have become very popular, imbuing generative AI systems with long-term memory (e.g., allowing a chat agent to remember previous exchanges in a long conversation), knowledge modularization (e.g., responses can differ based on user credentials), and post-training adaptation (e.g., new knowledge can be added without re-training the model), while also helping mitigate hallucinations by grounding model output with existing knowledge. In RAGs, inference is prefaced by a retrieval step that is commonly implemented using vector search. The prompt is used to retrieve similar elements that can be presented to the model. The model is then instructed to use these vectors when considering the prompt and generating the output.

RAGs are rapidly changing the landscape of AI, and the demand for them is driving the rapid deployment of large-scale systems that are inherently expensive to own and operate. Lowering the operating cost of HW and SW systems is paramount to the operational viability of companies that are focused on products and services in this space. Vector search plays an important role in both the performance and cost of these systems. In small pre-production deployments of RAG, the vector search time is relatively quick compared to LLM inference and is often overlooked. However, vector search can become a bottleneck when scaling deployments to handle production traffic on large vector collections. Here, one large-scale vector collection (think billions) needs to be searched concurrently by many users and the latencies can grow by orders of magnitude. The temptation is to reach for expensive GPUs with limited memory to improve performance, which can spiral costs out of control while only chipping away at performance.

Our work shows that a modern data center CPU is sufficient to run vector search at production levels with sufficient speed (latency and throughput) while keeping the cost of the system down by saving on memory and storage.

Vector Search in a Nutshell

Large datasets with thousands of dimensions are increasingly common. For example, modern large language embedding models generate vectors with 768-1536 dimensions (and newer models even go to 4096 dimensions). Performing a linear scan over a dataset of a billion high-dimensional vectors to find the nearest neighbors, i.e., exact nearest neighbor search, becomes too slow to meet application requirements.

We can avoid a full scan if we can accept a small error in the returned nearest neighbors (i.e., some vectors might be incorrectly retrieved). This is called approximate nearest neighbor search and offers a two-fold solution to this double-edged problem:

Vector indices are data structures used to organize the vectors so that a search only accesses a small portion of the entire set, thus reducing memory accesses.
Vector representations are data encoding formats that compress each vector so that it can be retrieved faster, occupies less memory, and the computational kernel retains efficiency.

Vector Indices: How to Organize Vectors to Accelerate Their Search

Many tradeoffs must be considered when selecting an approach to vector indexing. A number of important considerations are shown in the diagram below and expanded upon. The goal of designing a cost-effective performant system is to maximize these aspects, subject to the tradeoffs and the goals of the application. For example, the business cost model may favor reduced memory footprint at the expense of some accuracy. Or it may be more important to search vectors faster for an LLM application at the expense of additional indexing time. One of the biggest factors in balancing these aspects is the choice of indexing algorithm.

Search speed: Ideally, search time should grow much slower than the database size. This is particularly important when millions to billions of vectors are involved. Indices should offer low latency (how long does a search take to complete), as we do not want a user waiting, and good throughput (how many queries per second), as we want to service many users concurrently.

Search accuracy: Retrieving vectors that are actually similar to a given query is an essential component of this task. This is truly important in recommender systems, where we want to recommend things that are relevant, and in RAG, where there is a strong correlation between retrieval similarity and RAG performance.

Indexing speed: To accelerate the search, we need to organize the data in an orderly fashion. In stark contrast to the ease of indexing a totally ordered set (e.g., when indexing documents by date), indexing multidimensional vectors is hard and time consuming.

Dimensional scaling: As the number of dimensions grows, the curse of dimensionality kicks in, rendering a quick reduction of the search space difficult. Creating indices that show resistance to this phenomenon is as important as scaling with database size.

Mutability: In most cases, the contents of a database change over time: vectors are added, removed, or updated. As the time taken to rebuild an entire index from scratch with each modification is prohibitive, modern indices support in-place modifications, i.e., they are mutable. Additionally, the data distribution might change over time, and the index needs to adapt seamlessly to such changes.

Footprint: The complexity of indexing databases that are growing in size (in dimensionality and number of vectors) means that the size of the data structures underpinning the index also grows. Today, memory and storage are important drivers for the cost of a HW system serving vector search.

Several categories of algorithms exist for approximate nearest neighbor search. The first solutions to appear were based on trees, which are ideally suited to index one-dimensional objects. However, these extensions suffer from the curse of dimensionality, becoming worse as the number of dimensions grows. The next generation of solutions stems from inverted indices (IVF), and clustering is used only to explore a subset of the database. These solutions are fast when low search accuracy is required, but their performance degrades very quickly when further accuracy is needed. These indices also become hard to maintain in a dynamic setting, where the contents of the database change over time (adding, removing, and updating vectors). Graph methods constitute the current state-of-the-art, offering higher accuracy at higher speeds, exhibiting better dimensional scaling, and handling the dynamic setting with ease.

In graph methods, the index consists of a directed graph, where each vertex corresponds to a database vector, and the sparse set of edges represents neighbor relationships between vectors. These relationships are built so that the graph can be efficiently traversed to find the nearest neighbors in sub-linear time using a greedy best-first algorithm: at each step, move to the neighbor that is most similar to the query. Backtracking is used to return multiple neighbors within a single search.

Vector Representations

Of equal consideration to how the vectors are organized for search is how vectors are generated and represented. The literature on large-scale vector search emphasizes the computational intensity of this workload with the goal of reducing the number of distance computations. However, a simple kernel for computing vector similarity (e.g., a vector dot product), which can be efficiently realized using modern vectorized instructions (e.g., AVX512), resides at the heart of this workload. Thus, the real challenge lies in efficiently fetching vectors from main memory (or storage).

Additionally, the memory footprint occupied vectors is quickly becoming a problem. For example, storing 100 million vectors, each with 1536 dimensions, in single-precision floating-point format requires 572GB. In some cases, the number and size of vectors may make it impossible to host them in the memory of a single machine.

Thus, many solutions have focused on how to make the vectors smaller while retaining their key characteristics. Again, tradeoffs exist among several key considerations that favor different aspects of the problem. For example, one may opt to save memory by compressing vectors more aggressively in exchange for less resolution in their semantic affinities. In the diagram below, we expand upon these considerations and how they interact with each other.

Similarity computation speed: Similarity computations must remain simple and fast, as we do not want to slow down the search needlessly.

Compression rate: This is used to describe the amount of memory we are sparing. In the simplest scenario, by compressing the entries of a vector from single-precision (FP32) to half-precision (FP16) floating-point values, we achieve a compression rate of 2.

Search accuracy: A high accuracy level is required in most modern use cases. We want to spare memory and memory bandwidth, all while reaching the required target accuracy. Achieving high compression rates in this high-accuracy regime is an open problem in search.

Training complexity: If the compression method is data-driven (e.g., using machine learning), training the model should be relatively fast. Training a deep neural network to compress vectors might take too long.

Mutability: The compression method needs to be robust to distribution shifts in the collection of vectors. If (partially) re-training a model is needed when a shift occurs, this needs to occur quickly to enable index mutability (see previous section).

Encoding speed: Any new vectors added to the database must be encoded using the compression method of choice. If such encoding is too computationally onerous, index mutability (see previous section) will be compromised. For example, although others have explored the use of deep neural networks for vector compression, deployments remain limited by the high inference runtime of such models.

Query distribution: The search accuracy must be preserved, particularly when query vectors have different characteristics than the database vectors. This is one consideration that we will expound on in a future blog in this series.

Most vectors can be processed into representations using three categories of transformation: (1) scalar quantization, (2) vector quantization, and (3) dimensionality reduction. More and more commonly, specific solutions combine these categories for compounded effects. In future blogs, we will dive into these different solutions and the innovations that we have introduced as part of our work at Intel Labs.

Vector Search at Intel

Many billion-scale similarity search systems lack the high-performance computing block-and-tackling necessary to wring out enough vectorized distance computation performance and sufficient memory savings, as described in the previous sections. In response to this, we have developed Intel Scalable Vector Search (SVS), which addresses these limitations with highly optimized implementations of vector indices and new vector compression techniques. SVS provides vector similarity search:

on billions of high-dimensional vectors
at high accuracy
at state-of-the-art speed
using less memory than its alternatives

This enables application and framework developers using similarity search to unleash its performance on Intel® Xeon® CPUs (2nd generation and newer). SVS offers a fully featured yet simple Python API compatible with most standard libraries. Additionally, SVS is written in C++ to facilitate its integration into performance-critical applications. Watch the demo below to see SVS in action.

Future blogs in this series will dive deeper into vector compression, dimensionality reduction, and RAG systems. For now, we hope you will give SVS a try and learn to appreciate for yourself the difference that smart indexing and representation choices can make!

mihaic · ‎02-04-2025

@Scott_Bair, did you intend to include a demo video? The last sentence is: "Watch the demo below to see SVS in action."