Table of Contents

vLLM

vLLM is a high-throughput, memory-efficient open-source inference and serving engine for large language models, developed by UC Berkeley's Sky Computing Lab. It features innovations like PagedAttention for KV cache management and continuous batching for dynamic request handling.1)

PagedAttention

PagedAttention treats GPU memory like an operating system's virtual memory, breaking the KV cache into smaller, reusable non-contiguous pages instead of large per-request blocks.2) This approach:

The mechanism separately optimizes prefill (initial prompt processing) and decode (iterative token generation) phases, with features like prefix caching and chunked prefill for long sequences.

Continuous Batching

vLLM uses continuous (dynamic) batching, processing incoming requests in a continuous stream rather than static batches.3) This:

Tensor Parallelism

vLLM supports tensor parallelism and pipeline parallelism for distributed inference across multiple GPUs, alongside:4)

Performance Benchmarks

2026 benchmarks on H100 GPUs show vLLM alongside newer competitors:5)

Engine Throughput (Llama 3.1 8B, H100) Notes
vLLM ~12,500 tok/s Mature ecosystem; strong multi-turn
SGLang ~16,200 tok/s 29% faster in some workloads
LMDeploy ~16,100 tok/s Best for quantized models
TensorRT-LLM Highest raw throughput Higher memory and setup cost

While newer engines show throughput advantages, vLLM remains widely adopted for its maturity, ecosystem, and production reliability.

OpenAI-Compatible API

vLLM provides an OpenAI-compatible API server supporting:6)

Supported Models

Deployment

See Also

References