Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
The context window defines the maximum number of tokens a language model can process in a single forward pass. It determines how much text the model can “see” at once, encompassing the system prompt, conversation history, and generated output. Extending context windows from 512 tokens (BERT) to over 1 million tokens (Gemini) has been one of the most impactful advances in LLM capabilities, enabled primarily by innovations in positional encoding.
In a Transformer, every token attends to every other token within the context window. The model has no memory beyond this window – information outside it is invisible. The context budget must accommodate:
Total token usage: $\text{tokens}_{\text{system}} + \text{tokens}_{\text{input}} + \text{tokens}_{\text{output}} \leq \text{context\_window}$
The computational cost of attention scales as $O(n^2)$ in sequence length $n$, making long context windows expensive. Memory for the KV cache grows as $O(n \cdot L \cdot h \cdot d_k)$ where $L$ is layers, $h$ is heads, and $d_k$ is head dimension.
Since Transformers process all tokens in parallel, they need explicit position information. The choice of positional encoding determines how well models generalize to sequence lengths beyond training.
The original Transformer used fixed sinusoidal functions:
$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$
These are added to token embeddings. While elegant, sinusoidal encodings do not extrapolate well beyond the trained sequence length.
RoPE encodes position by rotating query and key vectors in 2D subspaces:
$$q_m = q \cdot e^{im\theta}, \quad k_n = k \cdot e^{in\theta}$$
where $\theta_j = 10000^{-2j/d}$ defines the rotation frequency for dimension pair $j$. The key property is that the dot product between rotated queries and keys depends only on relative position:
$$q_m^T k_n = \text{Re}[q \cdot \bar{k} \cdot e^{i(m-n)\theta}]$$
In matrix form, for each 2D subspace at position $m$:
$$R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$$
RoPE is used by Llama, Mistral, Qwen, and most modern open-weight LLMs. Its advantages include capturing relative positions naturally, no learnable parameters, and amenability to context extension.
ALiBi adds a linear bias to attention scores based on distance between tokens:
$$\text{Attention}_{ij} = q_i^T k_j - m \cdot |i - j|$$
where $m$ is a head-specific slope (set geometrically, e.g., $m_h = 2^{-8h/H}$ for head $h$ of $H$ total). ALiBi requires no positional embeddings at all and extrapolates better than sinusoidal encodings with zero additional parameters. Used in BLOOM and some ALiBi-tuned models.
YaRN extends RoPE-based models to longer contexts by partitioning frequency dimensions:
This approach requires only a small amount of fine-tuning (e.g., 400 steps) to extend a 4K-context model to 64K or 128K tokens, compared to full retraining.
| Model | Context Window | Position Encoding | Release |
|---|---|---|---|
| GPT-4 Turbo | 128K tokens | Unknown (proprietary) | 2023 |
| Claude 3.5 Sonnet | 200K tokens | Proprietary | 2024 |
| Gemini 1.5 Pro | 1M tokens | Proprietary | 2024 |
| Gemini 2.0 | 2M tokens | Proprietary | 2025 |
| Llama 3.1 | 128K tokens | RoPE | 2024 |
| Mistral Large | 128K tokens | Sliding Window + RoPE | 2024 |
| Yi-Lightning | 200K tokens | RoPE (extended) | 2024 |
Research by Liu et al. (2023) demonstrated that LLMs perform significantly worse when relevant information is placed in the middle of the context window compared to the beginning or end:
Mitigation strategies:
The needle-in-a-haystack test evaluates long-context retrieval by:
This evaluation reveals:
import torch import math def apply_rope(x, seq_len, dim, base=10000): # Compute RoPE frequencies freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim)) positions = torch.arange(seq_len).float() angles = torch.outer(positions, freqs) # [seq_len, dim/2] # Create rotation matrices as complex numbers cos_vals = torch.cos(angles) sin_vals = torch.sin(angles) # Reshape input into pairs x_pairs = x.view(*x.shape[:-1], -1, 2) x_real, x_imag = x_pairs[..., 0], x_pairs[..., 1] # Apply rotation out_real = x_real * cos_vals - x_imag * sin_vals out_imag = x_real * sin_vals + x_imag * cos_vals return torch.stack([out_real, out_imag], dim=-1).flatten(-2) def yarn_scale_freqs(base_freq, original_len, target_len, beta_fast=32, beta_slow=1): # YaRN: partition frequencies into interpolated vs preserved ratio = target_len / original_len low_freq_factor = beta_slow high_freq_factor = beta_fast scaled = base_freq / ratio # interpolated # Blend between original and scaled based on frequency magnitude blend = (base_freq - low_freq_factor) / (high_freq_factor - low_freq_factor) blend = blend.clamp(0, 1) return base_freq * (1 - blend) + scaled * blend