Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
The KV cache (key-value cache) is a memory optimization technique used in transformer-based language models to accelerate text generation by caching previously computed attention key-value pairs. During autoregressive decoding, where tokens are generated sequentially, the KV cache eliminates redundant recomputation of attention weights for tokens that have already been processed, significantly improving inference speed and reducing computational overhead 1).
Transformer models rely on multi-head self-attention mechanisms where each attention head computes three representations from the input: queries (Q), keys (K), and values (V). During the forward pass, attention scores are calculated as softmax(QK^T/√d_k)V, where d_k is the dimension of the key vectors. In standard inference without caching, generating each new token requires recomputing keys and values for all previously generated tokens, resulting in O(n²) complexity where n is the sequence length 2).
The KV cache stores the computed K and V matrices from previous decoder steps, allowing the model to reuse these values when processing the next token. When generating token at position t, the model computes Q for the new token and performs attention only against the cached K,V pairs from positions 1 through t-1, plus the newly computed K,V for position t. This reduces the attention computation from O(n²) to O(n) per token generation step. The KV cache mechanism explicitly stores high-dimensional representations of every previous token in memory to generate the next token 3).
While KV caching dramatically accelerates inference, it introduces memory overhead that scales with sequence length and batch size. For a model with hidden dimension d_h, number of attention heads h, and sequence length n, the KV cache requires approximately 2×n×d_h×h×bytes per layer (factor of 2 accounts for both keys and values). Across all L layers, total KV cache memory becomes O(2×L×n×d_h×h). This linear scaling with context length means that for long-context applications, KV cache memory can become the primary memory bottleneck, sometimes exceeding model parameter memory during inference 4).
The KV cache creates O(N^2) memory complexity that becomes prohibitive at scale, particularly for context windows exceeding 100K-1M tokens 5), underscoring the need for memory optimization strategies. Several techniques have been developed to reduce KV cache memory requirements. KV cache quantization stores cached tensors in lower-precision formats (int8 or int4) rather than full precision (float32 or bfloat16), reducing memory usage by 4-8× with minimal quality degradation. Cache sparsification methods selectively retain only high-importance keys and values, discarding less relevant cached vectors. Sliding window attention limits caching to a fixed window of recent tokens, particularly effective for tasks where long-distance dependencies are less critical. Modern implementations achieve significant memory reduction through aggressive optimization; for example, DeepSeek-V4 achieves just 10% of DeepSeek-V3.2's KV cache at 1M-token context through aggressive cache reduction and layered memory systems 6).
In typical transformer inference frameworks, the KV cache is implemented as a fixed-size buffer allocated at model initialization based on maximum expected sequence length, or as a dynamically allocated structure that grows with each generated token. Most production implementations (such as those in vLLM or DeepSpeed inference) pre-allocate KV cache memory for entire batches to ensure contiguous memory access patterns and optimal GPU utilization 7).
The PagedAttention mechanism further optimizes KV cache memory allocation by dividing the cache into logical “pages” that map to physical memory blocks, allowing non-contiguous allocation similar to virtual memory in operating systems. This approach reduces memory fragmentation and enables more efficient batch packing, allowing up to 4× higher throughput than contiguous allocation strategies.
Despite its effectiveness, KV caching introduces several constraints on model deployment. Long-context applications requiring millions of tokens can exhaust available GPU memory despite aggressive quantization. Sequence length generalization presents another challenge: models trained with KV caching optimized for specific context lengths sometimes perform poorly when extended to longer sequences due to positional encoding assumptions and attention pattern shifts. Techniques like position interpolation and rotary position embeddings partially address this issue but remain an active research area 8).
Batched inference with variable-length sequences further complicates KV cache management, as padding all sequences to maximum length wastes cache memory. This tension between memory efficiency and computational efficiency remains a key limitation in scaling transformer-based systems to handle diverse production workloads.