Inference Optimization

LLM inference optimization addresses the critical challenge of serving large language models efficiently in production. Raw Transformer inference is bottlenecked by memory bandwidth (KV cache), compute (attention), and scheduling inefficiency (variable-length requests). Modern techniques achieve 2-10x throughput improvements through quantization, memory management, batching strategies, and speculative decoding.

Inference Pipeline

graph LR Request["Incoming Requests"] --> Queue["Request Queue"] Queue --> CB["Continuous Batching Scheduler"] CB --> Prefill["Prefill Phase (prompt processing)"] Prefill --> Decode["Decode Phase (token generation)"] Decode --> KV["KV Cache (PagedAttention)"] KV --> Decode Decode --> Detok["Detokenize"] Detok --> Stream["Stream Response"] subgraph Optimizations Quant["Quantized Weights (INT4/INT8)"] Flash["Flash Attention Kernels"] Spec["Speculative Decoding"] end Quant --> Decode Flash --> Decode Spec --> Decode

Serving Frameworks

vLLM is the leading open-source LLM serving framework, built around PagedAttention:

Continuous batching with preemption support
PagedAttention for efficient KV cache management
Tensor parallelism and pipeline parallelism for multi-GPU
Support for GPTQ, AWQ, and other quantization formats
Chunked prefill to overlap prefill and decode phases
Typical throughput: 2-4x over naive HuggingFace serving

TGI (Text Generation Inference) by Hugging Face:

Production-ready with gRPC and HTTP APIs
Flash Attention and Paged Attention integration
Quantization support (GPTQ, bitsandbytes)
Token streaming and watermarking
Optimized for Hugging Face model hub integration

PagedAttention

PagedAttention manages the KV cache like virtual memory in operating systems:

The KV cache is divided into fixed-size blocks (pages) rather than one contiguous tensor
A page table maps logical KV positions to physical GPU memory locations
Blocks are allocated on demand and freed when sequences complete
Multiple sequences can share KV blocks (e.g., for shared prompt prefixes via copy-on-write)

Benefits:

Reduces memory fragmentation by up to 90% for variable-length sequences
Enables larger batch sizes (more concurrent requests)
Supports efficient beam search through block sharing
Dynamic allocation eliminates the need to pre-allocate maximum-length KV buffers

Continuous Batching

Unlike static batching (pad all sequences to max length, wait for all to finish), continuous batching:

Inserts new requests into the batch as soon as a slot opens
Removes completed sequences immediately without blocking others
Avoids padding waste for variable-length sequences
Achieves 2-4x throughput over static batching by maximizing GPU utilization

Quantization

Quantization reduces model weight precision to decrease memory footprint and increase throughput:

Method	Bits	Type	Quality	Speed Gain	Notes
FP16/BF16	16	Baseline	Full	1x	Standard training precision
GPTQ	4	Post-training (PTQ)	~99.5% of FP16	2-3x	Second-order weight quantization, GPU-optimized
AWQ	4	Post-training (PTQ)	~99.5% of FP16	2-3x	Activation-aware, preserves salient channels
GGUF	2-8	Post-training	Varies	2-4x	llama.cpp format, CPU+GPU inference
bitsandbytes	4/8	Dynamic	Good	1.5-2x	Easy integration, NF4 data type

GPTQ uses second-order information (Hessian) to minimize quantization error per layer. It quantizes weights to 4-bit integers with group-wise scaling factors, achieving near-lossless quality with 75% memory reduction.

AWQ identifies that a small fraction of weight channels disproportionately affect output (corresponding to activation outliers) and protects them during quantization, yielding slightly better quality than naive approaches.

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It supports mixed quantization levels (e.g., important layers at higher precision) and enables inference on CPUs, Apple Silicon, and GPUs.

Speculative Decoding

Speculative decoding uses a small draft model to generate candidate tokens, which the larger target model verifies in parallel:

The draft model (e.g., 1B params) generates $k$ candidate tokens autoregressively (fast)
The target model (e.g., 70B params) scores all $k$ candidates in a single forward pass
Accepted tokens are kept; the first rejected token is resampled from the target distribution
Expected speedup: $1/(1-\alpha)$ where $\alpha$ is the acceptance rate

Typical results: 2-3x decode speedup when draft and target models are well-matched, with mathematically guaranteed identical output distribution.

KV Cache Optimization

Beyond PagedAttention, several techniques reduce KV cache memory:

Multi-Query Attention / Grouped-Query Attention: Reduce KV heads from $h$ to 1 or $g$, shrinking cache proportionally
KV cache quantization: Quantize cached keys/values to INT8 or INT4 with minimal quality loss
Sliding window attention: Only cache the last $w$ tokens (used in Mistral)
Token eviction: Selectively drop low-attention KV entries based on attention score statistics

Benchmarks

Typical throughput comparisons (Llama 2 70B, A100 80GB):

Configuration	Tokens/sec	Relative
HuggingFace naive	~30	1x
vLLM FP16	~120	4x
vLLM + AWQ INT4	~250	8x
vLLM + AWQ + speculative	~400	13x
TGI FP16	~100	3.3x

Note: benchmarks vary significantly by hardware, model, sequence length, and batch size.

Code Example

# Serving a quantized model with vLLM
from vllm import LLM, SamplingParams
 
# Load a GPTQ-quantized model with PagedAttention
llm = LLM(
    model="TheBloke/Llama-2-70B-Chat-GPTQ",
    quantization="gptq",
    tensor_parallel_size=2,       # 2 GPUs
    max_model_len=4096,
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory for KV cache
)
 
# Configure sampling
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)
 
# Batch inference with continuous batching handled internally
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a linked list.",
    "What are the key differences between TCP and UDP?",
]
outputs = llm.generate(prompts, params)
 
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text[:100]}...")
    print(f"Tokens generated: {len(output.outputs[0].token_ids)}")

AI Agent Knowledge Base

Sidebar

Table of Contents

Inference Optimization

Inference Pipeline

Serving Frameworks

PagedAttention

Continuous Batching

Quantization

Speculative Decoding

KV Cache Optimization

Benchmarks

Code Example

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Inference Optimization

Inference Pipeline

Serving Frameworks

PagedAttention

Continuous Batching

Quantization

Speculative Decoding

KV Cache Optimization

Benchmarks

Code Example

References

See Also

Page Tools