Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
816 Discussions

KVCrush: Rethinking KV Cache Alternative Representation for Faster LLM Inference

SamehGobriel
Employee
0 0 1,715

Gopi Krishna Jha and Sameh Gobriel are research scientists at Intel Labs specializing in algorithms and AI optimizations, and co-author Nilesh Jain is a principal engineer at Intel Labs specializing in AI system architecture and optimizations. Liubov Talamanova and Vasily Shamporov are OpenVINO architects specializing in designing AI/ML optimization for the Intel OpenVINO framework.

 

Highlights

  • Developed by Intel, KVCrush can improve LLM inference throughput up to 4x with less than 1% accuracy drop.
  • KVCrush introduces a novel binary representation and low-overhead memory pruning for efficient KV cache management.
  • Compatible with existing KV cache compression and paging schemes, KVCrush will be integrated into the OpenVINO™ GenAI library starting with the 2025.3 release.

 

With KVCrush, Intel researchers have reimagined how large language models (LLMs) store their "memories," allowing for a significantly smaller footprint while preserving the model's capabilities, making LLM inference faster and more accessible. Accepted as a paper at the Asian Conference on Machine Learning (ACML 2025), this research introduces a novel binary representation and low-overhead memory pruning for efficient key-value (KV) cache management. KVCrush can seamlessly combine with various KV cache compression technologies, offering a compatible approach for accelerating LLM inference. KVCrush will be integrated into the OpenVINO™ GenAI library starting with the 2025.3 release.

KVCrush solves the issue of dramatically decreasing LLM memory without losing accuracy. Imagine trying to have a long, detailed prompt session with an artificial intelligence (AI) model, asking it to summarize an entire book or analyze a massive document. To handle these complex tasks quickly and efficiently, large language models rely on a clever internal system called the KV cache. Think of it as the LLM's short-term memory, storing bits of information it has already “thought about” so it doesn't have to re-think everything from scratch. This is crucial for making LLMs fast and responsive, especially when generating long, coherent texts.

However, the longer the conversation or the more information an LLM processes, the bigger and more demanding the KV cache becomes. This massive memory footprint is a huge roadblock, limiting how many users an LLM can serve simultaneously and the complexity of the tasks it can perform. While researchers have tried to shrink this memory, this approach often sacrifices the AI model's accuracy, making it less reliable.

KVCrush proposes an alternative representation scheme for key-value states, coupled with a low-overhead token pruning algorithm that considers the token distribution within the KV cache. This innovative approach allows for a significantly smaller memory capacity while preserving the accuracy of the model. KVCrush can reduce the KV cache size by up to 4x with less than 1% accuracy drop, achieving state-of-the-art (SoTA) average accuracy with minimal overhead. This technology not only outperforms the accuracy of SoTA importance-based token retention schemes but is also fully compatible with typical practical LLM deployments using KV cache paging schemes such as virtual LLM and mixed precision quantization.

 

KVCrush's Novel Approach: Alternative Token Representation

To tackle the KV cache memory challenge, KVCrush introduces a hardware-efficient alternative representation for tokens. Instead of the original high-dimensional floating-point vectors (which can have hundreds or thousands of dimensions, such as 4,096 dimensions for Llama-65B), KVCrush converts each token's representation into a much smaller binary vector. This compact binary representation, with a length equal to the number of attention heads (for example, 128 heads for Llama-65B), is derived from analyzing how distinct attention heads in the LLM process information.

The main intuition behind the alternative representation is that each attention head is a specialized detector, focusing on different semantic aspects of a token (for example, its recency, punctuation, or specific contextual importance). By capturing these individual “decisions” across all heads, we can create a compact yet semantically rich binary fingerprint for each token. As a result, the compact binary representation of each token preserves crucial semantic insights about token importance and similarities while significantly reducing the representation size. Moreover, this short binary representation enables far faster distance computations (for example, using Hamming distances to measure the difference between two sequences) to semantically compare any pair of tokens. Hamming distance computation for binary vectors is exceptionally efficient compared to operations on high-dimensional floating-point vectors.

Figure 1 outlines how KVCrush represents tokens by shortening binary vectors while preserving their semantic information. This binary representation of tokens acts as proxies for evicting tokens using a low-overhead grouping algorithm to ensure better context representation.

 

Figure 1 KVCrush.png

 

Figure 1. Creating binary feature vectors using head behavior. KVCrush intuition: each attention head treats (focuses) a given token in a different way.

 

The process of generating the alternative binary representation for input tokens involves a few key steps. First, the attention weight matrix is computed (this is a standard part of LLM inference, so KVCrush reuses this computation without added overhead). Second, for each attention head, a threshold is applied, deciding if a token would typically be “retained” (a “1” bit) or “evicted” (a “0” bit) based on attention scores — a method consistent with conventional KV compression algorithms like H2O or SnapKV. Finally, these individual bits from all attention heads are then collated into a compact binary feature vector, serving as the token's distinct digital signature.

 

Low-Overhead Token Grouping and Cache Eviction

KVCrush intelligently manages the KV cache by grouping tokens and deciding which ones to keep. Think of it like this: the total memory available for the KV cache is smartly divided. One part is reserved for the most “important” tokens, identified by established compression methods like H2O, PyramidKV, or SnapKV. The other part is where KVCrush shines, storing a carefully chosen set of “representative” tokens. These aren't just random leftovers — they act as stand-ins for the tokens that would otherwise be discarded, making sure the model still has a complete understanding of the context even with less memory (hence, the improved accuracy compared to other KV cache compression technologies).

 

Figure 2 KVCrush.png

 

Figure 2. Low-overhead token grouping algorithm to ensure all token groups are represented in the resulting KV cache.

 

Figure 2 highlights the main idea behind KVCrush’s smart token grouping. Once tokens are transformed into their compact binary fingerprints, the grouping algorithm selects an “anchor point” and quickly calculates how “close” each token's binary fingerprint is to this anchor using Hamming distance – a computation that's remarkably fast thanks to the binary representation. Each token is then assigned to a “bucket” with its closest match. Finally, a single representative is chosen from each bucket. This ensures that a diverse range of token types is always present in the reduced cache, which is key to maintaining accuracy. This entire grouping process is incredibly lightweight, running with minimal overhead (less than 0.5% in total inference latency) and significantly faster than traditional, more compute-intensive methods like k-means clustering.

 

Seamless Integration and High Performance

KVCrush is engineered for efficient integration with existing KV cache optimization technologies, including various compression methods, mixed-precision quantization, and KV cache paging schemes. This inherent compatibility facilitates its straightforward deployment within practical LLM inference pipelines.

 

Figure 3 KVCrush.png

Figure 3. Comparison of KVCrush with PyramidKV, SnapKV, and H2O on LongBench workload.

 

As substantiated by our comprehensive evaluations (see Figure 3), when paired with the most effective KV compression methods, KVCrush consistently achieves the highest accuracy across the majority of datasets and delivers strong average accuracy on both Llama-3-8B-Instruct and Mistral-7B-Instruct-v0.2 models. This performance demonstrates KVCrush's robust capability to mitigate the accuracy degradation typically associated with higher KV compression ratios, thereby providing a more reliable and precise solution for memory-efficient LLM inference.

 

KVCrush Integration into OpenVINO™ GenAI

Starting with the 2025.3 release, KVCrush is integrated into the OpenVINO GenAI library. This integration is made possible by OpenVINO's existing architecture, which uses a paged attention mechanism and continuous batching for efficient LLM serving. OpenVINO's paged attention manages memory in logical blocks, and KVCrush is engineered to operate at this same block-wise level, making it a natural and efficient fit.

Within this framework, KVCrush intelligently manages a portion of the KV cache memory budget. It works alongside other popular compression methods like H2O and SnapKV by using a low-overhead grouping algorithm to retain a diverse and representative set of tokens. This ensures that even with a reduced memory footprint, the model's context is preserved, leading to minimal accuracy loss and more scalable LLM inference. This powerful combination allows OpenVINO GenAI to support more efficient and responsive LLM deployments without compromising on performance.

The following table presents the accuracy results of OpenVINO with KVCrush technology, in comparison to the default OpenVINO paged attention with page eviction using OpenVINO test cases running on an X86 platform with CPU: Intel(R) Xeon(R) Gold 6430L with memory:1008G and OS: Ubuntu 22.04.3 LTS (Linux csr-spr1 5.15.0-144-generic).

 

Figure 4 KVCrush.png

Figure 4. Accuracy results of OpenVINO with KVCrush technology.

 

Conclusion and Future Work

KVCrush presents a significant improvement in optimizing large language model inference by offering a new way to represent tokens. KVCrush uses a compact binary encoding, combined with a highly efficient compression algorithm, leading to a notable reduction in KV cache memory usage. This approach makes it possible to use memory-efficient LLM inference pipelines without compromising the quality of the generated text. The integration of KVCrush into OpenVINO GenAI, starting with the 2025.3 release, will enable Intel customers on Intel platforms to achieve faster and more efficient LLM inference. Our ongoing work will investigate dynamic strategies for KVCrush's cache budget allocation and aim to improve token grouping through more advanced multi-anchoring techniques.

 

References

  1. Jha, G.K., Gobriel, S., Talamanova, L., Kozlov, A., & Jain, N. (2025) KVCrush: Key value cache size-reduction using similarity in head-behaviour, arXiv: 2503.00022. https://arxiv.org/abs/2503.00022
  2. Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., & Gao, J. (2023). Model tells you what to discard: Adaptive KV Cache compression for LLMs. arXiv:2310.01801. https://arxiv.org/abs/2310.01801
  3. Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., . . . Barrett, C. (2024). H2O: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper_files/paper/2023/file/6ceefa7b15572587b78ecfcebb2827f8-Paper-Conference.pdf
  4. Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., . . . Chen, D. (2024). SnapKV: LLM knows what you are looking for before generation. arXiv:2404.14469. https://arxiv.org/abs/2404.14469
  5. Zhang, Y., Gao, B., Liu, T., Lu, K., Xiong, W., Dong, Y., . . . Xiao, W. (2024). PyramidKV: Dynamic kv cache compression based on pyramidal information funneling. arXiv:2406.02069. https://arxiv.org/abs/2406.02069
  6. Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., . . . Li, J. (2024). Longbench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508. https://arxiv.org/abs/2308.14508