Table of Contents

Model Context Window

The context window defines the maximum number of tokens a language model can process in a single forward pass. It determines how much text the model can “see” at once, encompassing the system prompt, conversation history, and generated output. Extending context windows from 512 tokens (BERT) to over 1 million tokens (Gemini) has been one of the most impactful advances in LLM capabilities, enabled primarily by innovations in positional encoding.

How Context Windows Work

In a Transformer, every token attends to every other token within the context window. The model has no memory beyond this window – information outside it is invisible. The context budget must accommodate:

Total token usage: $\text{tokens}_{\text{system}} + \text{tokens}_{\text{input}} + \text{tokens}_{\text{output}} \leq \text{context\_window}$

The computational cost of attention scales as $O(n^2)$ in sequence length $n$, making long context windows expensive. Memory for the KV cache grows as $O(n \cdot L \cdot h \cdot d_k)$ where $L$ is layers, $h$ is heads, and $d_k$ is head dimension.

Positional Encoding Methods

Since Transformers process all tokens in parallel, they need explicit position information. The choice of positional encoding determines how well models generalize to sequence lengths beyond training.

Sinusoidal (Original)

The original Transformer used fixed sinusoidal functions:

$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

These are added to token embeddings. While elegant, sinusoidal encodings do not extrapolate well beyond the trained sequence length.

RoPE (Rotary Position Embedding)

RoPE encodes position by rotating query and key vectors in 2D subspaces:

$$q_m = q \cdot e^{im\theta}, \quad k_n = k \cdot e^{in\theta}$$

where $\theta_j = 10000^{-2j/d}$ defines the rotation frequency for dimension pair $j$. The key property is that the dot product between rotated queries and keys depends only on relative position:

$$q_m^T k_n = \text{Re}[q \cdot \bar{k} \cdot e^{i(m-n)\theta}]$$

In matrix form, for each 2D subspace at position $m$:

$$R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$$

RoPE is used by Llama, Mistral, Qwen, and most modern open-weight LLMs. Its advantages include capturing relative positions naturally, no learnable parameters, and amenability to context extension.

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance between tokens:

$$\text{Attention}_{ij} = q_i^T k_j - m \cdot |i - j|$$

where $m$ is a head-specific slope (set geometrically, e.g., $m_h = 2^{-8h/H}$ for head $h$ of $H$ total). ALiBi requires no positional embeddings at all and extrapolates better than sinusoidal encodings with zero additional parameters. Used in BLOOM and some ALiBi-tuned models.

YaRN (Yet another RoPE extensioN)

YaRN extends RoPE-based models to longer contexts by partitioning frequency dimensions:

This approach requires only a small amount of fine-tuning (e.g., 400 steps) to extend a 4K-context model to 64K or 128K tokens, compared to full retraining.

Long Context Models

Model Context Window Position Encoding Release
GPT-4 Turbo 128K tokens Unknown (proprietary) 2023
Claude 3.5 Sonnet 200K tokens Proprietary 2024
Gemini 1.5 Pro 1M tokens Proprietary 2024
Gemini 2.0 2M tokens Proprietary 2025
Llama 3.1 128K tokens RoPE 2024
Mistral Large 128K tokens Sliding Window + RoPE 2024
Yi-Lightning 200K tokens RoPE (extended) 2024

Lost-in-the-Middle

Research by Liu et al. (2023) demonstrated that LLMs perform significantly worse when relevant information is placed in the middle of the context window compared to the beginning or end:

Mitigation strategies:

  1. Place critical information at the beginning or end of prompts
  2. Use retrieval-augmented generation to surface relevant passages
  3. Fine-tune with varied information placement
  4. Apply position interpolation techniques that improve mid-sequence attention

Needle-in-a-Haystack Evaluation

The needle-in-a-haystack test evaluates long-context retrieval by:

  1. Embedding a specific fact (“needle”) at a controlled position within irrelevant text (“haystack”)
  2. Varying both the total context length and the needle's relative position (0-100% depth)
  3. Measuring whether the model can accurately retrieve the needle fact
  4. Producing a 2D heatmap of accuracy vs. (context length, needle depth)

This evaluation reveals:

Code Example

import torch
import math
 
def apply_rope(x, seq_len, dim, base=10000):
    # Compute RoPE frequencies
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    positions = torch.arange(seq_len).float()
    angles = torch.outer(positions, freqs)  # [seq_len, dim/2]
 
    # Create rotation matrices as complex numbers
    cos_vals = torch.cos(angles)
    sin_vals = torch.sin(angles)
 
    # Reshape input into pairs
    x_pairs = x.view(*x.shape[:-1], -1, 2)
    x_real, x_imag = x_pairs[..., 0], x_pairs[..., 1]
 
    # Apply rotation
    out_real = x_real * cos_vals - x_imag * sin_vals
    out_imag = x_real * sin_vals + x_imag * cos_vals
 
    return torch.stack([out_real, out_imag], dim=-1).flatten(-2)
 
def yarn_scale_freqs(base_freq, original_len, target_len, beta_fast=32, beta_slow=1):
    # YaRN: partition frequencies into interpolated vs preserved
    ratio = target_len / original_len
    low_freq_factor = beta_slow
    high_freq_factor = beta_fast
 
    scaled = base_freq / ratio  # interpolated
    # Blend between original and scaled based on frequency magnitude
    blend = (base_freq - low_freq_factor) / (high_freq_factor - low_freq_factor)
    blend = blend.clamp(0, 1)
    return base_freq * (1 - blend) + scaled * blend

References

See Also