This technical deep dive was written by Mariano Tepper, and is a joint work with Cecilia Aguerrebere, Ishwar Bhati, Mark Hildebrand, and Ted Wilke as part of their research efforts while at Intel Labs.
Summary
As detailed in the previous post in this series, vector search has become a critical component in a wide range of applications that demand highly accurate and timely answers. High vector dimensionality puts vector search systems under compute and memory pressure, often leading to unacceptable performance. Additionally, cross-modal retrieval tasks have become increasingly common, e.g., where a user inputs a text query to find the most relevant images for that query. These queries often have different statistical distributions than the database embeddings (i.e., the queries are out of distribution), making it challenging to achieve high accuracy. In this post, we present LeanVec, a framework that combines dimensionality reduction with vector quantization to accelerate vector search on high-dimensional vectors while maintaining high accuracy in settings with out-of-distribution queries.
Introduction
In recent years, deep learning models have become very adept at creating high-dimensional embedding vectors whose spatial similarities reflect the semantic affinities between inputs (e.g., images, audio, video, text, genomics, and computer code). This capability has enabled a wide range of applications that search for semantically relevant results over massive collections of vectors by retrieving the nearest neighbors to a given query vector. Even considering the outstanding recent progress in similarity search, the performance of modern vector indices significantly degrades as the dimensionality increases.
Most predominantly are graph indices, which consist of a directed graph where each vertex corresponds to a dataset vector and edges represent neighbor-relationships between vectors so that the graph can be efficiently traversed to find the nearest neighbors in sub-linear time. Although graph-based indices stand out with their high accuracy and performance for moderate dimensionalities (D ≈ 100), their efficiency suffers when dealing with the dimensionalities typically produced by deep learning models (D ≈ 512, 768, 1536). Addressing this performance gap is paramount in a world where similarity search deployments overwhelmingly use vectors that stem from deep learning models.
The root cause of this performance degradation is that graph search is bottlenecked by the memory latency and bandwidth of the system, which is mainly consumed by fetching database vectors from memory in a random-like access pattern. Compressing the vectors appears to be a viable idea for alleviating memory pressure; however, existing vector compression techniques, such as PQ and SCANN, fall short as they perform poorly due to random memory access patterns or do not provide sufficient compression.
The Challenge with Out-of-Distribution Queries
Vector compression also becomes a more challenging problem when the statistical distributions of the database and the query vectors differ (here, we say that the queries are out-of-distribution or OOD; when the distributions match, the queries are in-distribution or ID). Unfortunately, this occurs frequently in modern applications, with two prominent examples. The first is cross-modal querying, i.e., where a user uses a query from one modality to fetch similar elements from a different modality. For instance, in text2image, text queries are used to retrieve semantically similar images. Second, queries and database vectors can be produced by different models, such as in question-answering applications.
The two-dimensional example below illustrates the importance of query-aware dimensionality reduction for maximum inner product search. The optimal solution for a query-agnostic method (like PCA, for example) would be to project the database (𝒳) and the query (Q) vectors onto the first principal axis of 𝒳 (large green arrow). This choice will yield a poor resolution of the inner products as this direction is orthogonal to the principal axis of Q (orange arrow), and the direction that actually yields useful information (the second principal axis of 𝒳) is lost.
Figure 1. Using a query-aware technique, we can select a direction that maximally preserves the inner products. In this case the second principal direction of 𝒳 and the principal direction of Q coincide and provide the best choice.
A Lightweight Technique for Dimensionality Reduction
LeanVec accelerates similarity search for deep learning embedding vectors by approximating the inner product between the query q and a database vector x as
The projection functions DRquery and DRDB reduce the vector dimensionality (the number of entries in each vector) and LVQ is a locally-adaptive vector quantization method that reduces the number of bits per entry. LeanVec uses linear functions DRquery and DRDB to down-project the query and database vectors, as depicted in Figure 2 below.
Figure 2. A schematic depiction of the LeanVec framework.
LeanVec first takes each database vector x and produces two compressions:
- A primary vector LVQ(DRDB(x)). The inner-product approximation <f(q), LVQ(DRDB(x))> is a semi-accurate representation of <q, x>
- A secondary vector LVQ(x). The inner-product approximation <q, LVQ(x)> is an accurate representation of <q, x>
The primary vectors are used to build and search the graph. Notably, our experimental results show that the graph construction is robust to quantization with LVQ and to our dimensionality reduction. The secondary vectors are only used for search.
During search, the primary vectors are used for searching the graph index. The reduced memory footprint decreases the time it takes to fetch each vector from memory. Furthermore, the lower dimensionality alleviates the algorithm's computational effort (i.e., requiring fewer fused multiply-add operations). This approximation enables efficient inner product calculations with individual database vectors (no batch-processing required), which makes it ideal for the random memory-access pattern encountered in graph search.
We compensate for the errors in the inner-product approximation by retrieving more candidates and using the secondary vectors to re-rank them to return the top-k. The dimensionality reduction for the query (i.e., computing f(q)) is done only once per search incurring a negligible overhead in the overall runtime.
Note that searches are an essential part of the graph construction process. As such, our achieved search acceleration directly translates into graph construction acceleration.
In LeanVec, we learn the projection functions DRquery and DRDB from data using novel mathematical optimization algorithms. These algorithms are designed for computational efficiency: their runtime depends on the number of dimensions and not on the number of vectors. Our algorithms also consider the statistical distributions of the database vectors and of a small sample of representative query vectors.
Results
The results speak for themselves. Although SVS is great without LeanVec, outperforming the best open-source implementation of an algorithm widely considered to be the most performant (HNSWlib), it is even better with LeanVec. The query throughput is increased by nearly 4x at the same level of recall (95% 10 recall@10) due to the reduction in the per-query memory bandwidth demand.
Figure 3. Search throughput results in queries per second on HNSWlib, SVS without LeanVec, and SVS with LeanVec.
Conclusion
LeanVec combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors produced by modern embedding models. LeanVec is particularly well suited for the case where queries are out of distribution, as observed in text2image and question-answering applications, for example.
For more information, check out our paper. LeanVec, when paired with the Intel Scalable Vector Search performance library, achieves great vector search performance using less memory.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.