A context window is the maximum number of tokens a large language model (LLM) can process in a single request, encompassing all input and output combined. It functions as the model's working memory: everything the model can “see” and reason about at once. 1)
Context windows are a direct consequence of the transformer architecture that underlies modern LLMs. The self-attention mechanism computes relationships between every pair of tokens, producing an O(n²) computational cost — doubling the sequence length quadruples the compute and memory required. 2)
Each token passes through attention layers where queries, keys, and values are computed via dot products. The results are softmax-weighted and aggregated to produce contextual representations. A KV cache stores key-value pairs from prior tokens to avoid redundant computation during generation, but this cache grows linearly with sequence length, consuming substantial GPU memory. 3)
Tokens beyond the window limit are simply truncated — the model has no access to them.
Early transformer models like BERT and GPT-2 operated with 512-token context windows, constrained by standard positional encodings and the quadratic scaling of attention. 4)
Architectural advances — rotary position embeddings (RoPE), grouped-query attention, flash attention, and speculative decoding — enabled rapid expansion:
| Model | Context Window |
| GPT-4 (original) | 8,192 tokens |
| GPT-4o | 128,000 tokens |
| Llama 3.1 | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Claude Sonnet 4 | 1,000,000 tokens |
| Gemini 1.5 Flash | 1,000,000 tokens |
| Gemini 2.5 Pro | 2,000,000 tokens |
The context window is shared across several categories of content:
All of these compete for the same fixed budget. Exceeding the limit forces truncation, typically dropping the earliest content. 5)
Research has shown that LLMs perform best on information placed at the beginning or end of the context window, with measurable degradation for content buried in the middle. This “lost-in-the-middle” effect means that simply having a large window does not guarantee the model will use all of its contents effectively. 6)
The context window determines the fundamental capabilities of an LLM application. Small windows suit focused, low-cost tasks. Large windows enable processing entire codebases, legal documents, or book-length texts in a single pass — but at higher computational cost and with attention-dilution risks. Choosing the right window size is a core architectural decision for any AI system. 7)