====== vLLM ======

**vLLM** is a high-throughput, memory-efficient open-source inference and serving engine for large language models, developed by UC Berkeley's Sky Computing Lab. It features innovations like PagedAttention for KV cache management and continuous batching for dynamic request handling.((source [[https://docs.vllm.ai/en/latest/|vLLM Documentation]]))

===== PagedAttention =====

PagedAttention treats GPU memory like an operating system's virtual memory, breaking the KV cache into smaller, reusable non-contiguous pages instead of large per-request blocks.((source [[https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/overview.html|NVIDIA vLLM Overview]])) This approach:

  * Reduces memory waste to near-zero through efficient allocation
  * Enables sharing of KV cache pages across requests
  * Supports larger batch sizes on the same hardware
  * Achieves up to 14-24x higher throughput than traditional methods

The mechanism separately optimizes prefill (initial prompt processing) and decode (iterative token generation) phases, with features like prefix caching and chunked prefill for long sequences.

===== Continuous Batching =====

vLLM uses continuous (dynamic) batching, processing incoming requests in a continuous stream rather than static batches.((source [[https://petronellatech.com/blog/vllm-the-lightweight-engine-powering-faster-cheaper-large-language-models/|vLLM Overview - PetronellaTech]])) This:

  * Maximizes GPU utilization by dynamically adding and removing sequences mid-execution
  * Handles heterogeneous requests with varying lengths
  * Supports streaming outputs and multi-LoRA serving
  * Reduces latency under real-world bursty traffic patterns

===== Tensor Parallelism =====

vLLM supports tensor parallelism and pipeline parallelism for distributed inference across multiple GPUs, alongside:((source [[https://docs.vllm.ai/en/latest/|vLLM Documentation]]))

  * Optimized CUDA/HIP graphs for reduced kernel launch overhead
  * Custom CUDA kernels integrating FlashAttention and FlashInfer
  * Quantization support (AWQ, GPTQ, INT4/INT8/FP8)
  * Speculative decoding for further throughput improvements
  * Dual batch overlap for concurrent prefill and decode

===== Performance Benchmarks =====

2026 benchmarks on H100 GPUs show vLLM alongside newer competitors:((source [[https://blog.premai.io/vllm-vs-sglang-vs-lmdeploy-fastest-llm-inference-engine-in-2026/|Fastest LLM Inference Engine 2026 - PremAI]]))

^ Engine ^ Throughput (Llama 3.1 8B, H100) ^ Notes ^
| vLLM | ~12,500 tok/s | Mature ecosystem; strong multi-turn |
| SGLang | ~16,200 tok/s | 29% faster in some workloads |
| LMDeploy | ~16,100 tok/s | Best for quantized models |
| TensorRT-LLM | Highest raw throughput | Higher memory and setup cost |

While newer engines show throughput advantages, vLLM remains widely adopted for its maturity, ecosystem, and production reliability.

===== OpenAI-Compatible API =====

vLLM provides an OpenAI-compatible API server supporting:((source [[https://docs.vllm.ai/en/latest/|vLLM Documentation]]))

  * Chat completions and text completions endpoints
  * Parallel sampling and beam search
  * Token streaming for real-time responses
  * High-throughput batch serving
  * Drop-in replacement for OpenAI API clients

===== Supported Models =====

  * Hugging Face Transformers models (LLaMA, Mistral, Falcon, GPT-NeoX, and instruction-tuned variants)
  * Quantized formats (AWQ, GPTQ, INT4/INT8/FP8)
  * Vision encoders (ViT) for multimodal models
  * Multi-LoRA serving for concurrent adapter inference
  * Prefix caching for shared prompt prefixes

===== Deployment =====

  * **Hardware** -- NVIDIA GPUs, AMD GPUs (ROCm), Intel/AMD/PowerPC CPUs, Gaudi accelerators, TPUs, AWS Trainium/Inferentia((source [[https://docs.vllm.ai/en/latest/|vLLM Documentation]]))
  * **API** -- simple Python API for programmatic use
  * **Docker** -- containerized deployment for production environments
  * **Kubernetes** -- scalable orchestration for multi-replica serving

===== See Also =====

  * [[text_generation_inference|Text Generation Inference]]
  * [[llama_cpp|llama.cpp]]
  * [[ollama|Ollama]]
  * [[hugging_face|Hugging Face]]

===== References =====