Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
LLM inference optimization addresses the critical challenge of serving large language models efficiently in production. Raw Transformer inference is bottlenecked by memory bandwidth (KV cache), compute (attention), and scheduling inefficiency (variable-length requests). Modern techniques achieve 2-10x throughput improvements through quantization, memory management, batching strategies, and speculative decoding.
vLLM is the leading open-source LLM serving framework, built around PagedAttention:
TGI (Text Generation Inference) by Hugging Face:
PagedAttention manages the KV cache like virtual memory in operating systems:
Benefits:
Unlike static batching (pad all sequences to max length, wait for all to finish), continuous batching:
Quantization reduces model weight precision to decrease memory footprint and increase throughput:
| Method | Bits | Type | Quality | Speed Gain | Notes |
|---|---|---|---|---|---|
| FP16/BF16 | 16 | Baseline | Full | 1x | Standard training precision |
| GPTQ | 4 | Post-training (PTQ) | ~99.5% of FP16 | 2-3x | Second-order weight quantization, GPU-optimized |
| AWQ | 4 | Post-training (PTQ) | ~99.5% of FP16 | 2-3x | Activation-aware, preserves salient channels |
| GGUF | 2-8 | Post-training | Varies | 2-4x | llama.cpp format, CPU+GPU inference |
| bitsandbytes | 4/8 | Dynamic | Good | 1.5-2x | Easy integration, NF4 data type |
GPTQ uses second-order information (Hessian) to minimize quantization error per layer. It quantizes weights to 4-bit integers with group-wise scaling factors, achieving near-lossless quality with 75% memory reduction.
AWQ identifies that a small fraction of weight channels disproportionately affect output (corresponding to activation outliers) and protects them during quantization, yielding slightly better quality than naive approaches.
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It supports mixed quantization levels (e.g., important layers at higher precision) and enables inference on CPUs, Apple Silicon, and GPUs.
Speculative decoding uses a small draft model to generate candidate tokens, which the larger target model verifies in parallel:
Typical results: 2-3x decode speedup when draft and target models are well-matched, with mathematically guaranteed identical output distribution.
Beyond PagedAttention, several techniques reduce KV cache memory:
Typical throughput comparisons (Llama 2 70B, A100 80GB):
| Configuration | Tokens/sec | Relative |
|---|---|---|
| HuggingFace naive | ~30 | 1x |
| vLLM FP16 | ~120 | 4x |
| vLLM + AWQ INT4 | ~250 | 8x |
| vLLM + AWQ + speculative | ~400 | 13x |
| TGI FP16 | ~100 | 3.3x |
Note: benchmarks vary significantly by hardware, model, sequence length, and batch size.
# Serving a quantized model with vLLM from vllm import LLM, SamplingParams # Load a GPTQ-quantized model with PagedAttention llm = LLM( model="TheBloke/Llama-2-70B-Chat-GPTQ", quantization="gptq", tensor_parallel_size=2, # 2 GPUs max_model_len=4096, gpu_memory_utilization=0.90, # Use 90% of GPU memory for KV cache ) # Configure sampling params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, ) # Batch inference with continuous batching handled internally prompts = [ "Explain quantum computing in simple terms.", "Write a Python function to sort a linked list.", "What are the key differences between TCP and UDP?", ] outputs = llm.generate(prompts, params) for output in outputs: print(f"Prompt: {output.prompt[:50]}...") print(f"Output: {output.outputs[0].text[:100]}...") print(f"Tokens generated: {len(output.outputs[0].token_ids)}")