What Is an LLM Context Window

A context window is the maximum number of tokens a large language model (LLM) can process in a single request, encompassing all input and output combined. It functions as the model's working memory: everything the model can “see” and reason about at once. ¹⁾

How It Works

Context windows are a direct consequence of the transformer architecture that underlies modern LLMs. The self-attention mechanism computes relationships between every pair of tokens, producing an O(n²) computational cost — doubling the sequence length quadruples the compute and memory required. ²⁾

Each token passes through attention layers where queries, keys, and values are computed via dot products. The results are softmax-weighted and aggregated to produce contextual representations. A KV cache stores key-value pairs from prior tokens to avoid redundant computation during generation, but this cache grows linearly with sequence length, consuming substantial GPU memory. ³⁾

Tokens beyond the window limit are simply truncated — the model has no access to them.

Evolution of Context Window Sizes

Early transformer models like BERT and GPT-2 operated with 512-token context windows, constrained by standard positional encodings and the quadratic scaling of attention. ⁴⁾

Architectural advances — rotary position embeddings (RoPE), grouped-query attention, flash attention, and speculative decoding — enabled rapid expansion:

Model	Context Window
GPT-4 (original)	8,192 tokens
GPT-4o	128,000 tokens
Llama 3.1	128,000 tokens
Claude 3.5 Sonnet	200,000 tokens
Claude Sonnet 4	1,000,000 tokens
Gemini 1.5 Flash	1,000,000 tokens
Gemini 2.5 Pro	2,000,000 tokens

What Fills the Context Window

The context window is shared across several categories of content:

Instructional Context — System prompts, persona definitions, and behavioral rules
Background Context — Retrieved documents, knowledge base entries, and supplementary data
Operational Context — The user's current query and immediate task inputs
Historical Context — Prior conversation turns providing continuity
Output tokens — Space reserved for the model's generated response

All of these compete for the same fixed budget. Exceeding the limit forces truncation, typically dropping the earliest content. ⁵⁾

The Lost-in-the-Middle Problem

Research has shown that LLMs perform best on information placed at the beginning or end of the context window, with measurable degradation for content buried in the middle. This “lost-in-the-middle” effect means that simply having a large window does not guarantee the model will use all of its contents effectively. ⁶⁾

Why It Matters

The context window determines the fundamental capabilities of an LLM application. Small windows suit focused, low-cost tasks. Large windows enable processing entire codebases, legal documents, or book-length texts in a single pass — but at higher computational cost and with attention-dilution risks. Choosing the right window size is a core architectural decision for any AI system. ⁷⁾