Unleashing AI's Potential: Exploring the Intel AVX-512 Integration with the Milvus Vector Database

ssair · ‎01-29-2024

Co-Authors:
Alexandr Guzhva, Principal Software Engineer, Zilliz
Li Liu, Principal Software Engineer, Zilliz

Intel's cutting-edge developer tools empower developers for AI workloads, which is especially critical for Vector Search (Semantic Similarity Search). This capability is crucial for AI applications like Chatbots with retrieval-augmented generation (RAG), e-commerce Product Recommendations, and anomaly/fraud detection, often dealing with billion+ vectors. Intel and Zilliz lead the charge, ensuring that leveraging hardware acceleration becomes a seamless, efficient developer experience, driving high-performance AI solutions using Vector Search.

The Intel AVX-512 instruction set is a natural evolution of Advanced Vector Extensions for the x86 instruction set. Compared to the AVX2 instruction set, AVX-512 provides wider SIMD registers (512-bit vs 256-bit for AVX2) and an additional 16 SIMD registers (32 vs. 16 for AVX2). Various AVX-512 extensions provide new specialized instructions. It is straightforward to write code using AVX-512 intrinsics if one is familiar with AVX2 intrinsics.

Milvus, a project built and maintained by the developers at Zilliz, stands out as the world’s most advanced open-source vector database. It has gained widespread adoption across numerous AI applications, including image processing, and recommendation systems, as well as helping large language models (LLMs) to reduce its hallucination. Starting from version 0.7.0, Milvus has integrated support for AVX-512, significantly enhancing its processing capabilities. In this blog, we will delve into how Milvus leverages AVX-512 and explore the performance gains it achieves.

How Milvus Leverages Intel AVX-512

Similarity metrics, or the methods for computing the distance between two vectors, are central to Vector Search. Vector Search, also known as vector similarity or nearest neighbor search, is used in data retrieval and information retrieval systems to find items or data points that are similar or closely related to a given query vector. In Vector Search, we represent data points, such as images, texts, and audio, as vectors in a high-dimensional space. The goal of vector search is to efficiently search and retrieve the most relevant vectors that are similar or nearest to a query vector.

The Milvus core search engine (Knowhere) uses AVX-512 for the most performance-critical code. We have organized Milvus's core search engine as follows. The code profiling indicates that the majority of time during the search is dedicated to specific performance-critical hot spots, such as those responsible for distance computations.

Typically, these hot spots are relatively small pieces of code. Therefore, it makes sense to have different versions of every hot spot code for every instruction set, while the rest of the library uses generic code (such as SSE4 for the x86 platform). As a result, the Milvus core search engine library can provide close-to-optimal performance on every CPU family by identifying available advanced instruction sets upon its load and using corresponding specialized optimized codes for hot spots accordingly.

Comparing Intel AVX-512 and AVX2 Compute Performance

In this short study, we focused purely on comparing the AVX-512 vs the AVX2 compute performance. We used our standard benchmarking tool, VectorDBBench, applied directly to the Milvus core search engine, skipping all the compute-unrelated parts. Milvus uses its open-source Knowhere library, which relies on heavily modified versions of industry-standard FAISS, hnswlib, DiskANN libraries.

Overall, we don’t provide a huge variety of index and query parameters in this study because the computational bottlenecks remain the same. So, the performance difference was expected to be a subject of the method itself.

All the computations were performed on the Amazon EC2 m7i.2xlarge machine on 10/25/23:

CPU: Intel(R) Sapphire Rapids 8488C CPU, up to 3.2 GHz
Number of virtual cores: 8
RAM: 32 GB
OS: Ubuntu 22.04.3 LTS with Linux kernel 6.2.0-1017-aws

The code was compiled using GCC Ubuntu 9.5.0-1ubuntu1~22.04 version.

Datasets

We used two similar datasets in which the data and the index reside in the RAM, allowing us to focus on a performance comparison:

Search case 5: Cohere 1M vectors, 768 dimensions, cosine distance,
Search case 10: OpenAI 500K vectors, 1536 dimensions, cosine distance.

The number of queries was set to 10K. The number of required nearest neighbors (Top-K) was set to 100.

We’ve used the following very popular indices for our comparison: HNSW, IVFFLAT, and IVFSQ.

Results

HNSW (Hierarchical Navigable Small World Graph) is the state-of-the-art method for approximate nearest neighbor search. HNSW searches for nearest neighbors by traversing several interconnected graphs that differ in density. Two different sets of parameters were used, in which M represents the out-degree of the graph (the larger M is, the higher the accuracy and the lower the performance), while ef represents the length of the candidate queue during search (the larger ef is, the higher the accuracy and the lower the performance).

IVFFLAT is a most straightforward IVF index that splits the original data into buckets, each holding a piece of the original data. The number of IVF buckets was set to 1024. The number of probed buckets for every query was set to 64.

IVFSQ is another IVF index, which stores SQ8-quantized representation of the original data in every bucket. The number of IVF buckets was set to 1024. The number of probed buckets for every query was set to 64.

Analysis

The performance gain for HNSW and IVFFlat indices is about 10%. The reason is that the bottleneck is the memory bandwidth. Still, it is a 10% free performance gain.

On the other hand, the IVFSQ index demonstrates almost 2x performance improvement because it has a 4x lower memory bandwidth requirement, so the compute becomes the limiting factor. This is where AVX-512 shows its advantage over AVX2.

There is some tiny difference between recall rates for AVX-512 and AVX2. This is because the addition and multiplication of floating point numbers are not associative, which may lead to some tiny discrepancies between the results of distance computations for AVX-512 and AVX-2 code in our case, especially for large dimensionalities. For example, a dot product between two vectors is computed by accumulating partial dot products in every lane of a resulting SIMD register and performing a final horizontal sum operation to get a final dot product value. However, the AVX-512 SIMD register has 16 lanes, and the AVX2 SIMD register has only 8, so rounding errors will differ. However, it is acceptable because neither AVX-512 nor AVX2 show any advantage regarding recall rates, meaning these rounding errors are random.

Stay tuned for more Intel AVX-512 improvements in Milvus

We will keep adding more AVX-512 improvements into Milvus's internal code because it allows us to get more performance-free. The core search engine will contain a new AVX-512 code for the SCANN index in the upcoming Milvus versions. You can watch our progress on our Github repository, try Milvus or Zilliz Cloud out for yourself with one of our bootcamps, or you can give us a star for all our efforts!

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

MegMude · ‎01-29-2024

Insightful read. Nearest neighbor and performance gains on IA- AVX 512!