====== Inference Optimization ======

LLM inference optimization addresses the critical challenge of serving large language models efficiently in production. Raw Transformer inference is bottlenecked by memory bandwidth (KV cache), compute (attention), and scheduling inefficiency (variable-length requests). Modern techniques achieve 2-10x throughput improvements through quantization, memory management, batching strategies, and speculative decoding.

===== Inference Pipeline =====

<mermaid>
graph LR
    Request["Incoming Requests"] --> Queue["Request Queue"]
    Queue --> CB["Continuous Batching Scheduler"]
    CB --> Prefill["Prefill Phase (prompt processing)"]
    Prefill --> Decode["Decode Phase (token generation)"]
    Decode --> KV["KV Cache (PagedAttention)"]
    KV --> Decode
    Decode --> Detok["Detokenize"]
    Detok --> Stream["Stream Response"]
    
    subgraph Optimizations
        Quant["Quantized Weights (INT4/INT8)"]
        Flash["Flash Attention Kernels"]
        Spec["Speculative Decoding"]
    end
    
    Quant --> Decode
    Flash --> Decode
    Spec --> Decode
</mermaid>

===== Serving Frameworks =====

**vLLM** is the leading open-source LLM serving framework, built around PagedAttention:
  * Continuous batching with preemption support
  * PagedAttention for efficient KV cache management
  * Tensor parallelism and pipeline parallelism for multi-GPU
  * Support for GPTQ, AWQ, and other quantization formats
  * Chunked prefill to overlap prefill and decode phases
  * Typical throughput: 2-4x over naive HuggingFace serving

**TGI (Text Generation Inference)** by Hugging Face:
  * Production-ready with gRPC and HTTP APIs
  * Flash Attention and Paged Attention integration
  * Quantization support (GPTQ, bitsandbytes)
  * Token streaming and watermarking
  * Optimized for Hugging Face model hub integration

===== PagedAttention =====

PagedAttention manages the KV cache like virtual memory in operating systems:

  * The KV cache is divided into fixed-size **blocks** (pages) rather than one contiguous tensor
  * A **page table** maps logical KV positions to physical GPU memory locations
  * Blocks are allocated on demand and freed when sequences complete
  * Multiple sequences can share KV blocks (e.g., for shared prompt prefixes via copy-on-write)

Benefits:
  * Reduces memory fragmentation by up to **90%** for variable-length sequences
  * Enables larger batch sizes (more concurrent requests)
  * Supports efficient beam search through block sharing
  * Dynamic allocation eliminates the need to pre-allocate maximum-length KV buffers

===== Continuous Batching =====

Unlike static batching (pad all sequences to max length, wait for all to finish), continuous batching:

  * Inserts new requests into the batch as soon as a slot opens
  * Removes completed sequences immediately without blocking others
  * Avoids padding waste for variable-length sequences
  * Achieves **2-4x throughput** over static batching by maximizing GPU utilization

===== Quantization =====

Quantization reduces model weight precision to decrease memory footprint and increase throughput:

^ Method ^ Bits ^ Type ^ Quality ^ Speed Gain ^ Notes ^
| FP16/BF16 | 16 | Baseline | Full | 1x | Standard training precision |
| GPTQ | 4 | Post-training (PTQ) | ~99.5% of FP16 | 2-3x | Second-order weight quantization, GPU-optimized |
| AWQ | 4 | Post-training (PTQ) | ~99.5% of FP16 | 2-3x | Activation-aware, preserves salient channels |
| GGUF | 2-8 | Post-training | Varies | 2-4x | llama.cpp format, CPU+GPU inference |
| bitsandbytes | 4/8 | Dynamic | Good | 1.5-2x | Easy integration, NF4 data type |

**GPTQ** uses second-order information (Hessian) to minimize quantization error per layer. It quantizes weights to 4-bit integers with group-wise scaling factors, achieving near-lossless quality with 75% memory reduction.

**AWQ** identifies that a small fraction of weight channels disproportionately affect output (corresponding to activation outliers) and protects them during quantization, yielding slightly better quality than naive approaches.

**GGUF** (GPT-Generated Unified Format) is the file format used by llama.cpp. It supports mixed quantization levels (e.g., important layers at higher precision) and enables inference on CPUs, Apple Silicon, and GPUs.

===== Speculative Decoding =====

Speculative decoding uses a small **draft model** to generate candidate tokens, which the larger **target model** verifies in parallel:

  - The draft model (e.g., 1B params) generates $k$ candidate tokens autoregressively (fast)
  - The target model (e.g., 70B params) scores all $k$ candidates in a single forward pass
  - Accepted tokens are kept; the first rejected token is resampled from the target distribution
  - Expected speedup: $1/(1-\alpha)$ where $\alpha$ is the acceptance rate

Typical results: **2-3x decode speedup** when draft and target models are well-matched, with mathematically guaranteed identical output distribution.

===== KV Cache Optimization =====

Beyond PagedAttention, several techniques reduce KV cache memory:

  * **Multi-Query Attention / Grouped-Query Attention**: Reduce KV heads from $h$ to 1 or $g$, shrinking cache proportionally
  * **KV cache quantization**: Quantize cached keys/values to INT8 or INT4 with minimal quality loss
  * **Sliding window attention**: Only cache the last $w$ tokens (used in Mistral)
  * **Token eviction**: Selectively drop low-attention KV entries based on attention score statistics

===== Benchmarks =====

Typical throughput comparisons (Llama 2 70B, A100 80GB):

^ Configuration ^ Tokens/sec ^ Relative ^
| HuggingFace naive | ~30 | 1x |
| vLLM FP16 | ~120 | 4x |
| vLLM + AWQ INT4 | ~250 | 8x |
| vLLM + AWQ + speculative | ~400 | 13x |
| TGI FP16 | ~100 | 3.3x |

Note: benchmarks vary significantly by hardware, model, sequence length, and batch size.

===== Code Example =====

<code python>
# Serving a quantized model with vLLM
from vllm import LLM, SamplingParams

# Load a GPTQ-quantized model with PagedAttention
llm = LLM(
    model="TheBloke/Llama-2-70B-Chat-GPTQ",
    quantization="gptq",
    tensor_parallel_size=2,       # 2 GPUs
    max_model_len=4096,
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory for KV cache
)

# Configure sampling
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Batch inference with continuous batching handled internally
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a linked list.",
    "What are the key differences between TCP and UDP?",
]
outputs = llm.generate(prompts, params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text[:100]}...")
    print(f"Tokens generated: {len(output.outputs[0].token_ids)}")
</code>

===== References =====

  * [[https://arxiv.org/abs/2309.06180|Kwon et al. - Efficient Memory Management for Large Language Model Serving with PagedAttention (2023)]]
  * [[https://arxiv.org/abs/2211.17192|Leviathan et al. - Fast Inference from Transformers via Speculative Decoding (2023)]]
  * [[https://arxiv.org/abs/2210.17323|Frantar et al. - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2023)]]
  * [[https://arxiv.org/abs/2306.00978|Lin et al. - AWQ: Activation-aware Weight Quantization (2023)]]
  * [[https://arxiv.org/abs/2205.14135|Dao et al. - FlashAttention (2022)]]

===== See Also =====

  * [[attention_mechanism|Attention Mechanism]]
  * [[transformer_architecture|Transformer Architecture]]
  * [[model_context_window|Model Context Window]]
  * [[tokenization|Tokenization]]