Hierarchical memory and context management refers to the multi-layered systems that AI agents use to store, organize, and retrieve information across different timescales and levels of abstraction. Drawing inspiration from human cognitive architecture, these systems typically distinguish between working memory for immediate context, episodic memory for past interactions, semantic memory for factual knowledge, and procedural memory for learned skills. Effective memory management is critical for agents that must maintain coherence across long conversations, learn from experience, and operate within the finite context windows of current language models.1)
Hierarchical architectures divide agent memory into tiers based on abstraction level, access latency, and retention duration:
Tier 1: Sensory/Reactive Memory. The lowest layer captures raw input and handles reflexive responses. In the Hierarchical Cognitive Agent architecture (2025), this is the reactive layer performing sensor-to-actuator processing. Information here is transient, lasting only milliseconds to seconds. See Sensory Memory.
Tier 2: Working Memory (Core Context). The LLM's context window holds actively manipulated information: current instructions, recent conversation turns, retrieved facts, and reasoning traces. This is the agent's “RAM,” always available to the model during generation. Capacity ranges from 128K tokens (GPT-4o) to 1M+ tokens (Gemini, Claude). See Short-Term Memory.
Tier 3: Recall/Episodic Buffer. Searchable conversation history and recent interaction logs, functioning as a “disk cache.” The agent can query this tier for recent events not currently in the context window. Letta implements this as recall memory with full-text and semantic search over past messages.
Tier 4: Long-Term/Archival Storage. Persistent storage in vector databases, knowledge graphs, or structured stores. This tier has effectively unlimited capacity but higher retrieval latency. It stores consolidated facts, historical interactions, and domain knowledge. See Long-Term Memory.
Information flows between tiers through promotion (important working memory items are archived), eviction (less relevant items are removed from active context), and retrieval (archived information is loaded back into working memory on demand).
MemGPT (Packer et al., 2023) introduced the OS-inspired approach to agent memory. The LLM acts as a processor operating on a two-tier virtual memory: core memory (always in context) and archival memory (external storage). The agent uses function calls to read, write, search, and page memory in and out of context, managing its own memory like an operating system manages virtual memory. This was the first system to give agents autonomous control over their memory management.2)
Letta (evolved from MemGPT) implements a production-ready three-tier hierarchy: core memory (persistent key-value blocks always in context), recall memory (searchable conversation history), and archival memory (long-term vector storage). Agents self-edit core memory blocks, search recall for recent history, and archive/retrieve from long-term storage. Letta adds cloud sync for cross-session persistence and multi-agent memory sharing.
H-MEM (Sun et al., 2025) introduces a four-level semantic hierarchy: Domain, Category, Trace, and Episode. Each level represents a different abstraction: episodes are raw interactions, traces are sequential event summaries, categories group related traces, and domains represent high-level topics. Pointer-based routing enables sublinear scaling to millions of memories, making H-MEM suitable for production agents with extensive histories.
G-Memory (Zhang et al., 2025) provides a three-layer graph architecture for multi-agent systems: interaction graphs (raw agent exchanges), query graphs (structured information needs), and insight graphs (generalized knowledge). This enables collaborative memory across agent teams, where insights discovered by one agent are accessible to others.3)
SHIMI (Helmi, April 2025) uses semantic tree traversal from abstract concepts down to specific entities, providing interpretable hierarchical retrieval. The tree structure allows agents to navigate from broad topics to precise facts, supporting both exploratory and targeted memory access.
Consolidation compresses detailed lower-tier memories into higher-level abstractions. Episodic interactions are summarized into semantic facts; frequently accessed patterns are promoted to core memory. H-MEM uses entropy-regularized gating to determine which episodes warrant trace creation, while MemGPT agents use explicit tool calls to update their core memory blocks with consolidated insights.
Eviction removes information from active tiers when capacity limits are reached. Strategies include: recency-based eviction (remove oldest items), importance-weighted eviction (remove least-attended items using attention scores), reinforcement-learned retention (train a policy to decide what to keep), and hard/soft thresholds that trigger eviction at capacity boundaries.
Promotion elevates important information to higher tiers or more persistent storage. In Letta, an agent might promote a user preference from recall memory to a core memory block, ensuring it is always available without retrieval. H-MEM uses pointer encoding and semantic alignment to promote traces to categories.
Retrieval Fusion combines results from multiple memory tiers when answering a query. An agent might check core memory for known facts, search recall for recent context, and query archival storage for historical knowledge, then fuse the results in the context window for coherent reasoning.
Within the working memory tier, agents use several strategies to maximize the utility of limited context:
Structured Memory Blocks. Letta uses named memory blocks (human, persona, system) in core memory that occupy fixed positions in the context, providing consistent, always-available information without retrieval overhead.
Dynamic Retrieval. Rather than maintaining all knowledge in context, agents retrieve relevant memories on demand using RAG techniques, keeping the context window lean and focused.
Priority-Based Allocation. Critical information (system instructions, user preferences, current task state) gets priority placement in context, while lower-priority items are offloaded to recall or archival tiers.
For agents to truly learn over time, hierarchical memory must persist across sessions and accumulate knowledge:
Session-Spanning State. Frameworks like Letta, Zep, and Mem0 maintain agent state between interactions, so an agent remembers prior conversations without re-prompting. Google's Memory Bank (2025) automates this for the Agent Development Kit.
Knowledge Evolution. As agents accumulate experience, their memory structures evolve. Zep's Graphiti tracks entity and fact changes over time via temporal knowledge graphs. Mem0 updates atomic facts when new information contradicts or supplements existing knowledge.
Multi-Agent Memory. G-Memory and shared archival stores in Letta enable teams of agents to share knowledge. One agent's discoveries become available to others, supporting collaborative problem-solving and reducing redundant work.
import time from collections import OrderedDict class TieredMemory: """Three-tier memory system with automatic promotion based on access frequency.""" def __init__(self, working_capacity: int = 5, recall_capacity: int = 20): self.working: OrderedDict[str, dict] = OrderedDict() # Tier 1: always in context self.recall: dict[str, dict] = {} # Tier 2: searchable buffer self.archival: dict[str, dict] = {} # Tier 3: long-term store self.working_capacity = working_capacity self.recall_capacity = recall_capacity def store(self, key: str, value: str, tier: str = "recall"): """Store a memory at the specified tier.""" entry = {"value": value, "access_count": 0, "created": time.time()} if tier == "working": self._add_to_working(key, entry) elif tier == "recall": self.recall[key] = entry else: self.archival[key] = entry def _add_to_working(self, key: str, entry: dict): """Add to working memory, evicting oldest if at capacity.""" if len(self.working) >= self.working_capacity: evicted_key, evicted = self.working.popitem(last=False) self.recall[evicted_key] = evicted # Demote to recall self.working[key] = entry def retrieve(self, key: str) -> str | None: """Retrieve a memory, searching tiers top-down. Promotes on frequent access.""" for tier_name, tier in [("working", self.working), ("recall", self.recall), ("archival", self.archival)]: if key in tier: tier[key]["access_count"] += 1 if tier[key]["access_count"] >= 3 and tier_name != "working": self._promote(key, tier_name) return tier[key]["value"] return None def _promote(self, key: str, from_tier: str): """Promote a frequently accessed memory to the next higher tier.""" if from_tier == "archival" and key in self.archival: entry = self.archival.pop(key) self.recall[key] = entry elif from_tier == "recall" and key in self.recall: entry = self.recall.pop(key) self._add_to_working(key, entry) def status(self) -> dict: return {t: list(s.keys()) for t, s in [("working", self.working), ("recall", self.recall), ("archival", self.archival)]} mem = TieredMemory(working_capacity=3) mem.store("user_name", "Alice", tier="working") mem.store("preference", "dark mode", tier="recall") mem.store("old_session", "discussed ML pipelines", tier="archival") Frequent access triggers promotion from archival -> recall -> working for _ in range(3): mem.retrieve("old_session") print("After promoting old_session:", mem.status()) for _ in range(3): mem.retrieve("old_session") print("After second promotion:", mem.status())