====== What Is an LLM Context Window ====== A **context window** is the maximum number of tokens a large language model (LLM) can process in a single request, encompassing all input and output combined. It functions as the model's working memory: everything the model can "see" and reason about at once. ((Source: [[https://redis.io/blog/llm-context-windows/|Redis - LLM Context Windows]])) ===== How It Works ===== Context windows are a direct consequence of the **transformer architecture** that underlies modern LLMs. The self-attention mechanism computes relationships between every pair of tokens, producing an O(n²) computational cost — doubling the sequence length quadruples the compute and memory required. ((Source: [[https://www.datacamp.com/blog/context-window|DataCamp - Context Window]])) Each token passes through attention layers where queries, keys, and values are computed via dot products. The results are softmax-weighted and aggregated to produce contextual representations. A **KV cache** stores key-value pairs from prior tokens to avoid redundant computation during generation, but this cache grows linearly with sequence length, consuming substantial GPU memory. ((Source: [[https://redis.io/blog/llm-context-windows/|Redis - LLM Context Windows]])) Tokens beyond the window limit are simply truncated — the model has no access to them. ===== Evolution of Context Window Sizes ===== Early transformer models like BERT and GPT-2 operated with **512-token** context windows, constrained by standard positional encodings and the quadratic scaling of attention. ((Source: [[https://redis.io/blog/llm-context-windows/|Redis - LLM Context Windows]])) Architectural advances — rotary position embeddings (RoPE), grouped-query attention, flash attention, and speculative decoding — enabled rapid expansion: | Model | Context Window | | GPT-4 (original) | 8,192 tokens | | GPT-4o | 128,000 tokens | | Llama 3.1 | 128,000 tokens | | Claude 3.5 Sonnet | 200,000 tokens | | Claude Sonnet 4 | 1,000,000 tokens | | Gemini 1.5 Flash | 1,000,000 tokens | | Gemini 2.5 Pro | 2,000,000 tokens | ===== What Fills the Context Window ===== The context window is shared across several categories of content: * **[[instructional_context|Instructional Context]]** — System prompts, persona definitions, and behavioral rules * **[[background_context|Background Context]]** — Retrieved documents, knowledge base entries, and supplementary data * **[[operational_context|Operational Context]]** — The user's current query and immediate task inputs * **[[historical_context|Historical Context]]** — Prior conversation turns providing continuity * **Output tokens** — Space reserved for the model's generated response All of these compete for the same fixed budget. Exceeding the limit forces truncation, typically dropping the earliest content. ((Source: [[https://nebius.com/blog/posts/context-window-in-ai|Nebius - Context Window in AI]])) ===== The Lost-in-the-Middle Problem ===== Research has shown that LLMs perform best on information placed at the **beginning** or **end** of the context window, with measurable degradation for content buried in the middle. This "lost-in-the-middle" effect means that simply having a large window does not guarantee the model will use all of its contents effectively. ((Source: [[https://redis.io/blog/llm-context-windows/|Redis - LLM Context Windows]])) ===== Why It Matters ===== The context window determines the fundamental capabilities of an LLM application. Small windows suit focused, low-cost tasks. Large windows enable processing entire codebases, legal documents, or book-length texts in a single pass — but at higher computational cost and with attention-dilution risks. Choosing the right window size is a core architectural decision for any AI system. ((Source: [[https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-a-context-window|McKinsey - What Is a Context Window]])) ===== See Also ===== * [[operational_context|Operational Context]] * [[background_context|Background Context]] * [[instructional_context|Instructional Context]] * [[historical_context|Historical Context]] * [[million_token_context_window|Value of 1-Million-Token Context Windows]] ===== References =====