Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Context window management encompasses the strategies and techniques used to effectively utilize the finite token limit of large language models during agent operation. Since LLMs can only process a fixed number of tokens at once, agents must carefully decide what information to include in each prompt, balancing task instructions, conversation history, retrieved knowledge, and tool outputs. Effective context management is critical for agent reliability, as poor context curation leads to missed information, hallucination, and degraded performance.
As agents operate over multiple steps in their agent loops, context accumulates rapidly: each tool call produces observations, each reasoning step adds to the conversation history, and each new piece of retrieved information competes for limited token space. Research from JetBrains (2025) demonstrated that even as context windows grow larger, models often struggle to make good use of all the information they are given – making efficient management more important than simply having a larger window.
Observation masking uses a rolling window approach to manage context by keeping an agent's reasoning and actions intact while replacing older observations with placeholders once they exceed a fixed window size. This approach is fast and inexpensive, hiding old tool outputs while preserving the agent's recent work.
The critical parameter is window size, which must be tuned for each agent architecture. Research shows that different agents track conversation history differently – for example, SWE-agent skips failed retry turns while OpenHands includes all turns, requiring larger masking windows to maintain performance.
The following example demonstrates a sliding window that keeps recent messages and summarizes older ones to stay within token limits:
# Sliding window with LLM summarization for context management from openai import OpenAI client = OpenAI() def count_tokens_approx(messages): return sum(len(m["content"]) for m in messages if isinstance(m.get("content"), str)) // 4 def summarize_messages(messages): text = "\n".join(f'{m["role"]}: {m["content"]}' for m in messages if isinstance(m.get("content"), str)) resp = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": f"Summarize this conversation concisely:\n{text}"}], ) return resp.choices[0].message.content def manage_context(messages, system_msg, max_tokens=3000, keep_recent=4): if count_tokens_approx(messages) <= max_tokens: return messages old = messages[:-keep_recent] recent = messages[-keep_recent:] summary = summarize_messages(old) return [system_msg, {"role": "system", "content": f"Prior context summary: {summary}"}] + recent # Usage in an agent loop system = {"role": "system", "content": "You are a helpful assistant."} messages = [system] for user_input in ["Hello", "Tell me about Python", "Now explain decorators"]: messages.append({"role": "user", "content": user_input}) messages = manage_context(messages, system) resp = client.chat.completions.create(model="gpt-4o", messages=messages) messages.append({"role": "assistant", "content": resp.choices[0].message.content})
A separate summarizer LLM compresses older interactions – observations, actions, and reasoning – into concise summaries while leaving the most recent turns unaltered. This preserves important context from earlier in the conversation without consuming full token budgets.
The effectiveness of summarization is measured by its compression ratio, defined as:
$$\rho = \frac{|\text{tokens}_{\text{original}}|}{|\text{tokens}_{\text{summary}}|}$$
where $\rho > 1$ indicates compression. Typical LLM summarizers achieve $\rho \in [4, 10]$ for conversational history. The information retention rate measures how much task-relevant content survives compression:
$$\eta = \frac{\text{task-relevant facts in summary}}{\text{task-relevant facts in original}}$$
The goal is to maximize $\rho$ while keeping $\eta$ close to 1. Research comparing observation masking and LLM summarization found that both approaches matched in cost savings and problem-solving ability after proper hyperparameter tuning, though they required different configurations depending on the underlying agent architecture.
The most effective strategy combines observation masking as a first line of defense with selective LLM summarization for information that falls outside the masking window. This leverages masking's speed and efficiency while using summarization to preserve critical context that would otherwise be lost.
Rather than maintaining all context in the window, RAG systems dynamically retrieve relevant information from external stores when needed:
RAG complements window management by moving context storage from the finite window to an effectively unlimited external store.
Multi-level context structures organize information by importance and recency:
The landscape of available context windows has expanded dramatically:
| Model | Context Window | Key Characteristics |
|---|---|---|
| Google Gemini 2.5 Pro | 2 million tokens | Largest production window; native multimodal; >99% retrieval accuracy; context caching for cost optimization |
| Anthropic Claude (Opus/Sonnet) | 200K standard; 1M beta | Consistent performance with <5% accuracy degradation across full window |
| Magic LTM-2-Mini | 100 million tokens | 1,000x efficiency over traditional attention; specialized for software development |
| OpenAI GPT-4 Turbo | 128,000 tokens | Reliable but shows slowdown and inconsistencies approaching maximum capacity |
| Meta Llama 3.1 | 128,000 tokens | Open-source flexibility; variable performance depending on infrastructure |
| Cohere Command-R+ | 128,000 tokens | Optimized for retrieval tasks with specialized architecture for context coherence |
Despite these large windows, effective context management remains essential because: (1) token costs scale linearly with context size, (2) models show degraded attention to information in the middle of long contexts (the “lost in the middle” problem), and (3) not all information is equally relevant to the current step.
The Model Context Protocol (MCP) has become the standard for connecting agents to external context sources. Rather than managing all context within a single window, MCP enables agents to maintain context across different tools and data sources through a standardized protocol. By early 2026, MCP has been adopted by OpenAI, Google, and the Linux Foundation, with 97 million monthly SDK downloads and 5,800+ available servers providing contextual data to agents.
See tool-using agents and modular architectures for how MCP integrates with agent tool use.
Rather than forcing a single agent to manage an enormous context, multi-agent architectures distribute context across specialized agents, each operating within a manageable window:
This approach has shown practical results: Fountain's multi-agent recruitment system achieved 50% faster screening and 2x candidate conversions by distributing context across specialized agents.
When designing context management for production agents: