Vector Quantization for Scalable Vector Search

Scott_Bair · ‎03-05-2025

This technical deep dive was written by Mariano Tepper, and is a joint work with Cecilia Aguerrebere, Ishwar Bhati, Mark Hildebrand, and Ted Wilke as part of their research efforts while at Intel Labs.

Summary:
The first post in this series introduced vector search, its relevance in today’s world, and the important metrics used to characterize it. We can achieve dramatic gains in vector search systems by improving their internal vector representations, as the majority of the search runtime is spent bringing vectors from memory to compute their similarity with the query. The focus of this post, Locally-adaptive Vector Quantization (LVQ), accelerates the search, lowers the memory footprint, and preserves the efficiency of the similarity computation.

Introduction

Graph-based methods for approximate nearest neighbor search have been empirically found to yield fewer errors at higher speeds. Despite only accessing a small number of vectors in the dataset per query, graph-based search algorithms continue to offer limited throughput at very large database sizes and a substantial memory footprint that makes single-machine deployment challenging. This is because graph algorithms access vectors following a random pattern, which presents further challenges to the system's throughput:

With random accesses, hardware prefetchers in CPUs are ineffective and are not able to hide memory latency.
Vectors become difficult to cache.

From these two points, we can infer that, although most of the literature on large-scale similarity search puts more emphasis on the computational intensity of this workload, the random memory access pattern together with the simplicity of the distance computation kernel (i.e., due to its amenability to implementations with AVX instructions) ultimately put the workload deep into the memory bound realm.

Locally-Adaptive Quantization: A Critical Ingredient for High Performance in SVS

Graph algorithms address the scale problem by reducing the number of vectors they explore. However, further gains can be achieved by reducing the size of these vectors, thereby lowering both memory access times and memory footprint. Existing techniques for compressing vectors introduce challenges, as they unacceptably lower accuracy and/or significantly slow down distance calculations and often end up canceling any gains.

We introduce a novel vector compression algorithm, Locally-adaptive Vector Quantization (LVQ) [1]. LVQ uses a simple and efficient compression method to keep SVS and the system operating in a memory-intensive regime. After centering the data, LVQ scales each vector individually (i.e., the local adaptation) and then performs uniform scalar quantization.

LVQ offers superior accuracy compared to other scalar quantization techniques that are as computationally efficient as LVQ. To show this, we compare three different scaling methods:

Global scaling: the whole dataset is treated as a collection of numbers to scale.
Per-dimension scaling: each dimension is scaled individually.
LVQ scaling: each vector is scaled individually.

The different scaling methods are represented in the diagram below on the left, and the effectiveness of the scaling methods is shown on the right. Shown are the empirical distributions of the values in each vector for a standard dataset. We show the mean across vectors plus/minus two standard deviations (for a Gaussian distribution, 95% are contained within two standard deviations). For 95% of the vectors, global and per-dimension scaling only utilize around 60% and 75% of the available range, respectively. This means that if each value is encoded, for example, using 8 bits, only 5 or 6 are effectively used. On the other hand, LVQ scaling utilizes the entire range, not wasting precious bits and thus yielding a superior encoding.

When using very high-dimensional vectors (e.g., 768 dimensions as in large language models) or when extremely high search accuracy is required, LVQ has a built-in second-level quantization. We use the first level to conduct a search with more neighbors than the requested number and the second level to re-rank the list of retrieved elements. This corrects the order, and we can finally trim the size of the ranked list to the needed number, as shown below.

LVQ improves search performance with a two-fold acceleration of similarity computations and an eight-fold reduction of the vector size. LVQ also decreases memory footprint with little impact on accuracy compared to uncompressed vectors. LVQ, when combined with SVS, establishes the new state of the art in terms of performance and memory footprint for billion-scale similarity search.

Going Turbo

LVQ stores consecutive logical dimensions sequentially in memory in 32-bit words as groups of eight 4-bit dimensions (we use 4 bits per dimension to illustrate). While simple and convenient, this choice requires significant effort to unpack encoded dimensions into a more useful form. Each pair of 32-bit words (containing 16 dimensions) needs to be unpacked into an AVX-512 register containing sixteen 32-bit unsigned integers. This operation takes seven assembly instructions. Once we have the data in this unpacked format, we can undo the quantization and perform the partial similarity computation for these 16 dimensions.

With Turbo LVQ, we recognized that consecutive logical dimensions need not be stored consecutively in memory and permute their order to facilitate faster decompression with SIMD instructions. Turbo LVQ uses a permuted memory layout storing groups of 128 dimensions into 64 bytes of memory. Dimension 0 is stored in the first 4 bits of the first register lane, dimension 1 is stored in the first 4 bits of the second register lane, continuing the pattern until dimension 16, which is stored in the second 4 bits of the first register lane (see the diagram above).

When unpacking, the entire 64-byte block is loaded into an AVX-512 register as 16 lanes of 32-bit integers. Then, the first sixteen dimensions are extracted by simply applying a bitwise mask to each lane. For subsequent dimensions, a shift needs to be applied before recovering the lowest 4 bits. With this strategy, unpacking 16 dimensions requires only two assembly instructions:

One load and one mask for the first 16 dimensions
One shift and one mask for each following group of 16 dimensions up to 128

All in all, with its computational savings, Turbo LVQ pushes the graph-based search problem even further into the memory-intensive regime.

Performance

On the standard LAION dataset with 100 million vectors, SVS outperforms the four most popular and carefully tuned libraries by a wide margin (see figure below). Here, 10 recall@10 is a standard accuracy metric for similarity search that measures the proportion of the true nearest 10 neighbors that are present in the retrieved 10 neighbors.

In addition to offering better performance, SVS has a smaller memory footprint. SVS can be configured to use different amounts of memory, offering different performance, depending on the user’s needs. In the figure below, LVQ outcompetes the second-best alternative:

In the low-memory regime (R=32), by up to 20.7x in throughput with up to 3x less memory
In the high-throughput regime (R=126), by 5.8x in throughput with 1.4x less memory

Although not optimized for million-scale search, SVS with LVQ achieves great vector search performance using less memory and yields state-of-the-art results. For more information, check out our papers on LVQ, and Turbo LVQ.