Short-term memory, often referred to as working memory in the context of AI agents, is a limited-capacity system that holds and manipulates information currently being processed for a task. In language model-based agents, short-term memory is typically implemented through the context window, which retains recent conversation history, retrieved information, and intermediate reasoning steps. This memory type is essential for maintaining coherence during multi-step tasks, but its finite capacity creates fundamental challenges around information prioritization and context management.
Alan Baddeley's working memory model (1974, updated 2000) proposes a multi-component system: a central executive for attentional control, a phonological loop for verbal information, a visuospatial sketchpad for spatial data, and an episodic buffer for integrating information across subsystems. George Miller's classic “magical number seven” (1956) established that human short-term memory holds roughly 7 plus-or-minus 2 chunks of information.
These cognitive constraints directly parallel LLM agent design. The context window functions as the agent's working memory buffer, the attention mechanism serves as the central executive directing focus, and chain-of-thought reasoning acts as an internal rehearsal loop that maintains information in an active state. Research from UChicago (2025) reveals that AI networks exhibit both “active” and “silent” working memory modes, paralleling biological findings about persistent neural activity versus synaptic plasticity-based storage.
Modern LLM agents implement short-term memory through several mechanisms:
Context Windows. The primary working memory of an LLM agent is its context window. As of 2025, context sizes have expanded dramatically: Gemini 1.5 Pro supports 1 million tokens (with 2M in preview), Claude 3.5/Opus supports 200K tokens (with 1M context in Claude Opus 4), and GPT-4o supports 128K tokens. These expanded windows allow agents to hold substantially more information “in mind” during a single task, but they remain finite and introduce latency and cost at scale.
KV Caches. During autoregressive generation, key-value (KV) caches store the attention computations from all prior tokens, functioning as an optimized short-term memory that avoids recomputing past context. For long contexts (1M+ tokens), KV cache management becomes critical. Techniques include sparse KV compression, quantization (reducing precision of cached values), and eviction strategies like H2O (Heavy Hitter Oracle) that discard low-importance keys while retaining high-attention tokens.
Chain-of-Thought as Working Memory. Chain-of-thought (CoT) prompting (Wei et al., 2022)1) externalizes the reasoning process into the context window, effectively using generated tokens as a scratchpad. This simulates active working memory manipulation: the agent “thinks out loud,” maintaining intermediate results in the token stream. Extended thinking modes in Claude and reasoning models like o1 use this principle, allocating substantial token budgets for internal deliberation.
Scratchpads. Some agent frameworks provide explicit scratchpad buffers where the agent can write and read intermediate calculations, plans, or observations. These function as dedicated working memory registers, separate from the conversation history, allowing the agent to organize information without polluting the user-facing output.
As tasks grow complex, agents must manage their limited working memory strategically:
Summarization and Compression. When context fills up, agents can summarize earlier portions of the conversation or retrieved documents, replacing detailed content with compressed representations. This trades detail for capacity, analogous to human chunking.
Sliding Windows. Some systems implement rolling context windows that drop the oldest tokens as new ones arrive, maintaining a fixed-size recent history. This works well for conversational agents but can lose critical early context.
Hierarchical Context. Advanced agents use hierarchical memory to offload less-critical information to long-term memory (vector stores, databases) while keeping only the most relevant items in the active context window, retrieving as needed.
In-Context Learning (ICL). Few-shot examples placed in the context window allow agents to adapt to new tasks without parameter updates. With 1M+ token windows, agents can include extensive examples, documentation, or even entire codebases as working context, effectively using ICL as a flexible form of task adaptation.
from dataclasses import dataclass, field @dataclass class Message: role: str content: str token_estimate: int = 0 def __post_init__(self): # Rough estimate: ~4 chars per token self.token_estimate = len(self.content) // 4 + 1 class SlidingWindowContext: """Manages conversation context within a token budget using a sliding window.""" def __init__(self, max_tokens: int = 4096, system_prompt: str = ""): self.max_tokens = max_tokens self.system_message = Message(role="system", content=system_prompt) self.messages: list[Message] = [] self.summary: str = "" def add(self, role: str, content: str): self.messages.append(Message(role=role, content=content)) self._trim_to_budget() def _trim_to_budget(self): """Remove oldest messages (keeping system prompt) to stay within token limit.""" while self._total_tokens() > self.max_tokens and len(self.messages) > 2: removed = self.messages.pop(0) # Accumulate a running summary of evicted messages self.summary += f"[{removed.role}: {removed.content[:80]}...] " def _total_tokens(self) -> int: total = self.system_message.token_estimate if self.summary: total += len(self.summary) // 4 + 1 total += sum(m.token_estimate for m in self.messages) return total def get_context(self) -> list[dict]: """Build the context window for the LLM call.""" context = [{"role": "system", "content": self.system_message.content}] if self.summary: context.append({"role": "system", "content": f"Earlier context: {self.summary}"}) context.extend({"role": m.role, "content": m.content} for m in self.messages) return context ctx = SlidingWindowContext(max_tokens=100, system_prompt="You are a helpful assistant.") for i in range(10): ctx.add("user", f"Message number {i}: tell me about topic {i} in detail please.") ctx.add("assistant", f"Here is information about topic {i} with extended details.") print(f"Messages in window: {len(ctx.messages)}") print(f"Evicted summary length: {len(ctx.summary)} chars") for msg in ctx.get_context(): print(f" [{msg['role']}] {msg['content'][:80]}")
Despite dramatic increases in context window size, fundamental limitations remain:
Quadratic Attention Cost. Standard transformer attention scales as O(n^2) in sequence length, making very long contexts computationally expensive. Architectures like Mamba (Gu and Dao, 2023)2) and RWKV explore linear-time alternatives, though they may trade off in-context learning capability.
Lost-in-the-Middle. Liu et al., 20233) demonstrated that LLMs struggle to attend to information in the middle of long contexts, performing best with information at the beginning or end. This “lost in the middle” phenomenon means that larger windows do not guarantee better utilization.
Effective Context vs. Nominal Context. Having a 1M token context window does not mean all 1M tokens are equally useful. Research on effective context length shows that retrieval accuracy degrades well before the nominal limit. Techniques like retrieval-augmented generation (RAG) help by placing the most relevant information at optimal positions in the context.
Benchmarks like LongMem, LOCOMO, and LongBench (2024-2025) specifically test agents' ability to recall and reason over long contexts, driving improvements in both architecture and memory management strategies.