Model Context Window

The context window defines the maximum number of tokens a language model can process in a single forward pass. It determines how much text the model can “see” at once, encompassing the system prompt, conversation history, and generated output. Extending context windows from 512 tokens (BERT) to over 1 million tokens (Gemini) has been one of the most impactful advances in LLM capabilities, enabled primarily by innovations in positional encoding.

How Context Windows Work

In a Transformer, every token attends to every other token within the context window. The model has no memory beyond this window – information outside it is invisible. The context budget must accommodate:

System prompt: Instructions and persona definitions
Input/history: User messages, conversation history, retrieved documents
Output: The generated completion tokens

Total token usage: $\text{tokens}_{\text{system}} + \text{tokens}_{\text{input}} + \text{tokens}_{\text{output}} \leq \text{context\_window}$

The computational cost of attention scales as $O(n^2)$ in sequence length $n$, making long context windows expensive. Memory for the KV cache grows as $O(n \cdot L \cdot h \cdot d_k)$ where $L$ is layers, $h$ is heads, and $d_k$ is head dimension.

Positional Encoding Methods

Since Transformers process all tokens in parallel, they need explicit position information. The choice of positional encoding determines how well models generalize to sequence lengths beyond training.

Sinusoidal (Original)

The original Transformer used fixed sinusoidal functions:

$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

These are added to token embeddings. While elegant, sinusoidal encodings do not extrapolate well beyond the trained sequence length.

RoPE (Rotary Position Embedding)

RoPE encodes position by rotating query and key vectors in 2D subspaces:

$$q_m = q \cdot e^{im\theta}, \quad k_n = k \cdot e^{in\theta}$$

where $\theta_j = 10000^{-2j/d}$ defines the rotation frequency for dimension pair $j$. The key property is that the dot product between rotated queries and keys depends only on relative position:

$$q_m^T k_n = \text{Re}[q \cdot \bar{k} \cdot e^{i(m-n)\theta}]$$

In matrix form, for each 2D subspace at position $m$:

$$R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$$

RoPE is used by Llama, Mistral, Qwen, and most modern open-weight LLMs. Its advantages include capturing relative positions naturally, no learnable parameters, and amenability to context extension.

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance between tokens:

$$\text{Attention}_{ij} = q_i^T k_j - m \cdot |i - j|$$

where $m$ is a head-specific slope (set geometrically, e.g., $m_h = 2^{-8h/H}$ for head $h$ of $H$ total). ALiBi requires no positional embeddings at all and extrapolates better than sinusoidal encodings with zero additional parameters. Used in BLOOM and some ALiBi-tuned models.

YaRN (Yet another RoPE extensioN)

YaRN extends RoPE-based models to longer contexts by partitioning frequency dimensions:

Low-frequency dimensions: Interpolated (scaled down) to fit longer sequences
High-frequency dimensions: Left unchanged (already capture short-range patterns)
Temperature scaling: Adjusts attention logit magnitudes for the extended length

This approach requires only a small amount of fine-tuning (e.g., 400 steps) to extend a 4K-context model to 64K or 128K tokens, compared to full retraining.

Long Context Models

Model	Context Window	Position Encoding	Release
GPT-4 Turbo	128K tokens	Unknown (proprietary)	2023
Claude 3.5 Sonnet	200K tokens	Proprietary	2024
Gemini 1.5 Pro	1M tokens	Proprietary	2024
Gemini 2.0	2M tokens	Proprietary	2025
Llama 3.1	128K tokens	RoPE	2024
Mistral Large	128K tokens	Sliding Window + RoPE	2024
Yi-Lightning	200K tokens	RoPE (extended)	2024

Lost-in-the-Middle

Research by Liu et al. (2023) demonstrated that LLMs perform significantly worse when relevant information is placed in the middle of the context window compared to the beginning or end:

Models show a U-shaped performance curve: best recall at the start and end of context
Performance degrades as context length increases, even within the supported window
This effect is consistent across models (GPT-3.5, Claude, Llama)
Cause: attention patterns develop “sinks” at early positions and recency bias at late positions

Mitigation strategies:

Place critical information at the beginning or end of prompts
Use retrieval-augmented generation to surface relevant passages
Fine-tune with varied information placement
Apply position interpolation techniques that improve mid-sequence attention

Needle-in-a-Haystack Evaluation

The needle-in-a-haystack test evaluates long-context retrieval by:

Embedding a specific fact (“needle”) at a controlled position within irrelevant text (“haystack”)
Varying both the total context length and the needle's relative position (0-100% depth)
Measuring whether the model can accurately retrieve the needle fact
Producing a 2D heatmap of accuracy vs. (context length, needle depth)

This evaluation reveals:

Models with robust positional encodings show uniform high accuracy across all positions
Weaker models show degradation at specific depths (especially middle) and longer contexts
Gemini 1.5 Pro and Claude 3 demonstrate near-perfect retrieval across their full context windows

Code Example

import torch
import math
 
def apply_rope(x, seq_len, dim, base=10000):
    # Compute RoPE frequencies
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    positions = torch.arange(seq_len).float()
    angles = torch.outer(positions, freqs)  # [seq_len, dim/2]
 
    # Create rotation matrices as complex numbers
    cos_vals = torch.cos(angles)
    sin_vals = torch.sin(angles)
 
    # Reshape input into pairs
    x_pairs = x.view(*x.shape[:-1], -1, 2)
    x_real, x_imag = x_pairs[..., 0], x_pairs[..., 1]
 
    # Apply rotation
    out_real = x_real * cos_vals - x_imag * sin_vals
    out_imag = x_real * sin_vals + x_imag * cos_vals
 
    return torch.stack([out_real, out_imag], dim=-1).flatten(-2)
 
def yarn_scale_freqs(base_freq, original_len, target_len, beta_fast=32, beta_slow=1):
    # YaRN: partition frequencies into interpolated vs preserved
    ratio = target_len / original_len
    low_freq_factor = beta_slow
    high_freq_factor = beta_fast
 
    scaled = base_freq / ratio  # interpolated
    # Blend between original and scaled based on frequency magnitude
    blend = (base_freq - low_freq_factor) / (high_freq_factor - low_freq_factor)
    blend = blend.clamp(0, 1)
    return base_freq * (1 - blend) + scaled * blend

Table of Contents