====== Model Context Window ====== The context window defines the maximum number of tokens a language model can process in a single forward pass. It determines how much text the model can "see" at once, encompassing the system prompt, conversation history, and generated output. Extending context windows from 512 tokens (BERT) to over 1 million tokens (Gemini) has been one of the most impactful advances in LLM capabilities, enabled primarily by innovations in positional encoding. ===== How Context Windows Work ===== In a Transformer, every token attends to every other token within the context window. The model has no memory beyond this window -- information outside it is invisible. The context budget must accommodate: * **System prompt**: Instructions and persona definitions * **Input/history**: User messages, conversation history, retrieved documents * **Output**: The generated completion tokens Total token usage: $\text{tokens}_{\text{system}} + \text{tokens}_{\text{input}} + \text{tokens}_{\text{output}} \leq \text{context\_window}$ The computational cost of attention scales as $O(n^2)$ in sequence length $n$, making long context windows expensive. Memory for the KV cache grows as $O(n \cdot L \cdot h \cdot d_k)$ where $L$ is layers, $h$ is heads, and $d_k$ is head dimension. ===== Positional Encoding Methods ===== Since Transformers process all tokens in parallel, they need explicit position information. The choice of positional encoding determines how well models generalize to sequence lengths beyond training. ==== Sinusoidal (Original) ==== The original Transformer used fixed sinusoidal functions: $$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$ These are added to token embeddings. While elegant, sinusoidal encodings do not extrapolate well beyond the trained sequence length. ==== RoPE (Rotary Position Embedding) ==== RoPE encodes position by rotating query and key vectors in 2D subspaces: $$q_m = q \cdot e^{im\theta}, \quad k_n = k \cdot e^{in\theta}$$ where $\theta_j = 10000^{-2j/d}$ defines the rotation frequency for dimension pair $j$. The key property is that the dot product between rotated queries and keys depends only on relative position: $$q_m^T k_n = \text{Re}[q \cdot \bar{k} \cdot e^{i(m-n)\theta}]$$ In matrix form, for each 2D subspace at position $m$: $$R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$$ RoPE is used by Llama, Mistral, Qwen, and most modern open-weight LLMs. Its advantages include capturing relative positions naturally, no learnable parameters, and amenability to context extension. ==== ALiBi (Attention with Linear Biases) ==== ALiBi adds a linear bias to attention scores based on distance between tokens: $$\text{Attention}_{ij} = q_i^T k_j - m \cdot |i - j|$$ where $m$ is a head-specific slope (set geometrically, e.g., $m_h = 2^{-8h/H}$ for head $h$ of $H$ total). ALiBi requires no positional embeddings at all and extrapolates better than sinusoidal encodings with zero additional parameters. Used in BLOOM and some ALiBi-tuned models. ==== YaRN (Yet another RoPE extensioN) ==== YaRN extends RoPE-based models to longer contexts by partitioning frequency dimensions: * **Low-frequency dimensions**: Interpolated (scaled down) to fit longer sequences * **High-frequency dimensions**: Left unchanged (already capture short-range patterns) * **Temperature scaling**: Adjusts attention logit magnitudes for the extended length This approach requires only a small amount of fine-tuning (e.g., 400 steps) to extend a 4K-context model to 64K or 128K tokens, compared to full retraining. ===== Long Context Models ===== ^ Model ^ Context Window ^ Position Encoding ^ Release ^ | GPT-4 Turbo | 128K tokens | Unknown (proprietary) | 2023 | | Claude 3.5 Sonnet | 200K tokens | Proprietary | 2024 | | Gemini 1.5 Pro | 1M tokens | Proprietary | 2024 | | Gemini 2.0 | 2M tokens | Proprietary | 2025 | | Llama 3.1 | 128K tokens | RoPE | 2024 | | Mistral Large | 128K tokens | Sliding Window + RoPE | 2024 | | Yi-Lightning | 200K tokens | RoPE (extended) | 2024 | ===== Lost-in-the-Middle ===== Research by Liu et al. (2023) demonstrated that LLMs perform significantly worse when relevant information is placed in the middle of the context window compared to the beginning or end: * Models show a **U-shaped performance curve**: best recall at the start and end of context * Performance degrades as context length increases, even within the supported window * This effect is consistent across models (GPT-3.5, Claude, Llama) * Cause: attention patterns develop "sinks" at early positions and recency bias at late positions **Mitigation strategies**: - Place critical information at the beginning or end of prompts - Use retrieval-augmented generation to surface relevant passages - Fine-tune with varied information placement - Apply position interpolation techniques that improve mid-sequence attention ===== Needle-in-a-Haystack Evaluation ===== The needle-in-a-haystack test evaluates long-context retrieval by: - Embedding a specific fact ("needle") at a controlled position within irrelevant text ("haystack") - Varying both the total context length and the needle's relative position (0-100% depth) - Measuring whether the model can accurately retrieve the needle fact - Producing a 2D heatmap of accuracy vs. (context length, needle depth) This evaluation reveals: * Models with robust positional encodings show uniform high accuracy across all positions * Weaker models show degradation at specific depths (especially middle) and longer contexts * Gemini 1.5 Pro and Claude 3 demonstrate near-perfect retrieval across their full context windows ===== Code Example ===== import torch import math def apply_rope(x, seq_len, dim, base=10000): # Compute RoPE frequencies freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim)) positions = torch.arange(seq_len).float() angles = torch.outer(positions, freqs) # [seq_len, dim/2] # Create rotation matrices as complex numbers cos_vals = torch.cos(angles) sin_vals = torch.sin(angles) # Reshape input into pairs x_pairs = x.view(*x.shape[:-1], -1, 2) x_real, x_imag = x_pairs[..., 0], x_pairs[..., 1] # Apply rotation out_real = x_real * cos_vals - x_imag * sin_vals out_imag = x_real * sin_vals + x_imag * cos_vals return torch.stack([out_real, out_imag], dim=-1).flatten(-2) def yarn_scale_freqs(base_freq, original_len, target_len, beta_fast=32, beta_slow=1): # YaRN: partition frequencies into interpolated vs preserved ratio = target_len / original_len low_freq_factor = beta_slow high_freq_factor = beta_fast scaled = base_freq / ratio # interpolated # Blend between original and scaled based on frequency magnitude blend = (base_freq - low_freq_factor) / (high_freq_factor - low_freq_factor) blend = blend.clamp(0, 1) return base_freq * (1 - blend) + scaled * blend ===== References ===== * [[https://arxiv.org/abs/2104.09864|Su et al. - RoFormer: Enhanced Transformer with Rotary Position Embedding (2021)]] * [[https://arxiv.org/abs/2108.12409|Press et al. - Train Short, Test Long: Attention with Linear Biases (ALiBi, 2022)]] * [[https://arxiv.org/abs/2309.00071|Peng et al. - YaRN: Efficient Context Window Extension of Large Language Models (2023)]] * [[https://arxiv.org/abs/2307.03172|Liu et al. - Lost in the Middle: How Language Models Use Long Contexts (2023)]] * [[https://arxiv.org/abs/2306.15595|Chen et al. - Extending Context Window of Large Language Models via Positional Interpolation (2023)]] ===== See Also ===== * [[transformer_architecture|Transformer Architecture]] * [[attention_mechanism|Attention Mechanism]] * [[inference_optimization|Inference Optimization]] * [[tokenization|Tokenization]]