Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
vLLM is a high-throughput, memory-efficient open-source inference and serving engine for large language models, developed by UC Berkeley's Sky Computing Lab. It features innovations like PagedAttention for KV cache management and continuous batching for dynamic request handling.1)
PagedAttention treats GPU memory like an operating system's virtual memory, breaking the KV cache into smaller, reusable non-contiguous pages instead of large per-request blocks.2) This approach:
The mechanism separately optimizes prefill (initial prompt processing) and decode (iterative token generation) phases, with features like prefix caching and chunked prefill for long sequences.
vLLM uses continuous (dynamic) batching, processing incoming requests in a continuous stream rather than static batches.3) This:
vLLM supports tensor parallelism and pipeline parallelism for distributed inference across multiple GPUs, alongside:4)
2026 benchmarks on H100 GPUs show vLLM alongside newer competitors:5)
| Engine | Throughput (Llama 3.1 8B, H100) | Notes |
|---|---|---|
| vLLM | ~12,500 tok/s | Mature ecosystem; strong multi-turn |
| SGLang | ~16,200 tok/s | 29% faster in some workloads |
| LMDeploy | ~16,100 tok/s | Best for quantized models |
| TensorRT-LLM | Highest raw throughput | Higher memory and setup cost |
While newer engines show throughput advantages, vLLM remains widely adopted for its maturity, ecosystem, and production reliability.
vLLM provides an OpenAI-compatible API server supporting:6)