Table of Contents

Cognitive Memory Architectures

CogMem is a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. Introduced by Zhang et al. (2025), CogMem addresses the fundamental limitation that LLMs excel at single-turn reasoning but lose accuracy and coherence over extended multi-turn interactions due to reasoning bias, task drift, hallucination, overconfidence, and memory decay.

graph TD A[User Input] --> B[Focus of Attention] B --> C[Direct Access Memory] C --> D[Long-Term Memory] D -.->|Retrieve patterns| C C -.->|Retrieve notes| B B --> E[LLM Reasoning] E --> F[Response] E -->|Store insights| C C -->|Consolidate| D

The Multi-Turn Reasoning Problem

Real-world applications demand iterative, multi-turn reasoning where the model must maintain coherence across dozens or hundreds of exchanges. Current approaches typically append full conversational histories to the context, causing:

TurnBench evaluations highlight six recurring failure modes: reasoning bias, task drift, hallucination, overconfidence, memory decay, and error propagation.

Three-Layer Memory Architecture

CogMem implements a hierarchical memory system inspired by Baddeley's model of human working memory:

Layer 1: Long-Term Memory (LTM). Consolidates cross-session reasoning strategies, reusable problem-solving patterns, and distilled insights accumulated across multiple interactions. LTM entries are stored in a vectorized database enabling high-speed semantic retrieval. This layer evolves over time as new information is integrated.

Layer 2: Direct Access Memory (DA). Functions as session-level working memory, maintaining concise notes of intermediate conclusions, sub-goals, and ongoing plans. Rather than storing full conversational histories, DA preserves structured summaries of key information. At session start, DA is populated with relevant memories retrieved from LTM.

Layer 3: Focus of Attention (FoA). Dynamically reconstructs a minimal, task-relevant reasoning context for each turn by selectively integrating current DA notes, retrieved LTM entries, summarized dialogue history, and new user input into a compact prompt.

Mapping to Baddeley's Working Memory

CogMem's architecture directly parallels the influential cognitive psychology model:

CogMem Component Baddeley's Model Function
Focus of Attention Central Executive Attentional control, selective focus
Direct Access Memory Phonological Loop / Visuospatial Sketchpad Short-term session storage
Long-Term Memory Declarative Memory Persistent knowledge and strategies

This grounding in cognitive science ensures the architecture captures how humans maintain coherent reasoning across extended interactions while managing limited attentional resources.

Focus of Attention Mechanism

The FoA operates as a dynamic gating function that prevents unbounded context growth. At each turn:

  1. Retrieves task-relevant notes from DA based on current reasoning state
  2. Queries LTM semantically for applicable prior reasoning patterns
  3. Incorporates a summarized (not full) dialogue history
  4. Integrates the new user input
  5. Synthesizes these elements into a compact, interpretable prompt
# Simplified CogMem Focus of Attention reconstruction
class FocusOfAttention:
    def __init__(self, ltm, da_memory):
        self.ltm = ltm        # Long-term vectorized store
        self.da = da_memory   # Session-level working memory
 
    def reconstruct_context(self, current_input, reasoning_state):
        # Step 1: Retrieve relevant session notes
        da_notes = self.da.retrieve(reasoning_state, top_k=5)
 
        # Step 2: Query long-term reasoning patterns
        ltm_patterns = self.ltm.semantic_search(current_input, top_k=3)
 
        # Step 3: Summarize dialogue history
        history_summary = self.da.get_compressed_history()
 
        # Step 4: Synthesize compact context
        return compose_prompt(
            ltm_patterns,     # Cross-session strategies
            da_notes,         # Current session state
            history_summary,  # Compressed history
            current_input     # New user turn
        )

Addressing Reasoning Failure Modes

CogMem's layered design targets each failure mode through specific mechanisms:

Reasoning Bias and Task Drift. LTM accumulates cross-session reasoning strategies, allowing the model to recognize and correct systematic biases by referencing prior successful patterns. DA's structured notes keep the current task goal explicitly represented.

Hallucination and Overconfidence. DA creates explicit checkpoints where conclusions must be justified. FoA enforces grounding in retrieved memories rather than relying solely on parametric knowledge.

Memory Decay. Prevented through three mechanisms: DA's session-level preservation of intermediate conclusions, LTM's persistent storage of distilled reasoning traces, and FoA's selective retrieval ensuring nothing is lost through context truncation.

Error Propagation. A dual-agent design – reasoning agent and memory agent collaborating continuously – enables error recovery by restoring coherence across turns rather than propagating early mistakes.

Mathematical Formulation

The context at turn $t$ is bounded rather than growing linearly:

$$C_t = \text{FoA}(\text{DA}_t, \text{LTM}, h_t, x_t) \quad \text{where} \quad |C_t| \leq K \quad \forall t$$

compared to the standard approach where $|C_t| = \sum_{i=1}^{t} |x_i| + |r_i|$ grows without bound.

TurnBench Evaluation Results

Experiments on TurnBench-MS using Gemini 2.5 Flash as the reasoning agent:

The results demonstrate that structured memory simultaneously improves reasoning accuracy while enforcing computational scalability – a rare case where better performance comes with lower cost.

Significance

CogMem transforms LLM behavior from reactive, context-dependent reasoning into adaptive, self-consistent systems capable of sustained multi-turn inference. By drawing on decades of cognitive psychology research, it provides a principled architecture for managing the fundamental tension between comprehensive memory and bounded computation.

The three-layer design offers a general template for any system that must reason coherently over extended interactions – from customer support to scientific research assistants to autonomous coding agents.

References

See Also