====== Cognitive Memory Architectures ======
**CogMem** is a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. Introduced by Zhang et al. (2025), CogMem addresses the fundamental limitation that LLMs excel at single-turn reasoning but lose accuracy and coherence over extended multi-turn interactions due to reasoning bias, task drift, hallucination, overconfidence, and memory decay(([[https://arxiv.org/abs/2512.14118|Zhang et al. (2025). CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models. arXiv:2512.14118]])).

<mermaid>
graph TD
    A[User Input] --> B[Focus of Attention]
    B --> C[Direct Access Memory]
    C --> D[[[long_term_memory|Long-Term Memory]]]
    D -.->|Retrieve patterns| C
    C -.->|Retrieve notes| B
    B --> E[LLM Reasoning]
    E --> F[Response]
    E -->|Store insights| C
    C -->|Consolidate| D
</mermaid>

===== The Multi-Turn Reasoning Problem =====
Real-world applications demand iterative, multi-turn reasoning where the model must maintain coherence across dozens or hundreds of exchanges. Current approaches typically append full conversational histories to the context, causing:

  * **Unbounded context growth**, Token count increases linearly with conversation length
  * **Higher computational costs**, Attention scales quadratically with context length
  * **Degraded reasoning efficiency**, Relevant information is buried among irrelevant history
  * **Cascading failures**, Early errors propagate and amplify through subsequent turns

TurnBench evaluations highlight six recurring failure modes: reasoning bias, task drift, hallucination, overconfidence, memory decay, and error propagation.

===== Three-Layer Memory Architecture =====
CogMem implements a hierarchical memory system inspired by Baddeley's model of human working memory(([[https://en.wikipedia.org/wiki/Baddeley%27s_model_of_working_memory|Baddeley, A. (2000). The Episodic Buffer: A New Component of Working Memory.]])):

**Layer 1: [[long_term_memory|Long-Term Memory]] (LTM).** Consolidates cross-session reasoning strategies, reusable problem-solving patterns, and distilled insights accumulated across multiple interactions. LTM entries are stored in a vectorized database enabling high-speed semantic retrieval. This layer evolves over time as new information is integrated.

**Layer 2: Direct Access Memory (DA).** Functions as session-level working memory, maintaining concise notes of intermediate conclusions, sub-goals, and ongoing plans. Rather than storing full conversational histories, DA preserves structured summaries of key information. At session start, DA is populated with relevant memories retrieved from LTM.

**Layer 3: Focus of Attention (FoA).** Dynamically reconstructs a minimal, task-relevant reasoning context for each turn by selectively integrating current DA notes, retrieved LTM entries, summarized dialogue history, and new user input into a compact prompt.

===== Mapping to Baddeley's Working Memory =====
CogMem's architecture directly parallels the influential cognitive psychology model:

^ CogMem Component ^ Baddeley's Model ^ Function ^
| Focus of Attention | Central Executive | Attentional control, selective focus |
| Direct Access Memory | Phonological Loop / Visuospatial Sketchpad | Short-term session storage |
| [[long_term_memory|Long-Term Memory]] | Declarative Memory | Persistent knowledge and strategies |

This grounding in cognitive science ensures the architecture captures how humans maintain coherent reasoning across extended interactions while managing limited attentional resources.

===== Focus of Attention Mechanism =====
The FoA operates as a dynamic gating function that prevents unbounded context growth. At each turn:

  - Retrieves task-relevant notes from DA based on current reasoning state
  - Queries LTM semantically for applicable prior reasoning patterns
  - Incorporates a summarized (not full) dialogue history
  - Integrates the new user input
  - Synthesizes these elements into a compact, interpretable prompt

<code python>
# Simplified CogMem Focus of Attention reconstruction
class FocusOfAttention:
    def __init__(self, ltm, da_memory):
        self.ltm = ltm        # Long-term vectorized store
        self.da = da_memory   # Session-level working memory
    
    def reconstruct_context(self, current_input, reasoning_state):
        # Step 1: Retrieve relevant session notes
        da_notes = self.da.retrieve(reasoning_state, top_k=5)
        
        # Step 2: Query long-term reasoning patterns
        ltm_patterns = self.ltm.semantic_search(current_input, top_k=3)
        
        # Step 3: Summarize dialogue history
        history_summary = self.da.get_compressed_history()
        
        # Step 4: Synthesize compact context
        return compose_prompt(
            ltm_patterns,     # Cross-session strategies
            da_notes,         # Current session state
            history_summary,  # Compressed history
            current_input     # New user turn
        )
</code>

===== Addressing Reasoning Failure Modes =====
CogMem's layered design targets each failure mode through specific mechanisms:

**Reasoning Bias and Task Drift.** LTM accumulates cross-session reasoning strategies, allowing the model to recognize and correct systematic biases by referencing prior successful patterns. DA's structured notes keep the current task goal explicitly represented.

**Hallucination and Overconfidence.** DA creates explicit checkpoints where conclusions must be justified. FoA enforces grounding in retrieved memories rather than relying solely on parametric knowledge.

**Memory Decay.** Prevented through three mechanisms: DA's session-level preservation of intermediate conclusions, LTM's persistent storage of distilled reasoning traces, and FoA's selective retrieval ensuring nothing is lost through context truncation.

**Error Propagation.** A dual-agent design, reasoning agent and memory agent collaborating continuously, enables error recovery by restoring coherence across turns rather than propagating early mistakes.

===== Mathematical Formulation =====
The context at turn $t$ is bounded rather than growing linearly:

$$C_t = \text{FoA}(\text{DA}_t, \text{LTM}, h_t, x_t) \quad \text{where} \quad |C_t| \leq K \quad \forall t$$

compared to the standard approach where $|C_t| = \sum_{i=1}^{t} |x_i| + |r_i|$ grows without bound.

===== TurnBench Evaluation Results =====
Experiments on TurnBench-MS using Gemini 2.5 Flash as the reasoning agent:

  * **Baseline (no memory):** Limited accuracy, establishing lower bound
  * **With DA Memory:** Total accuracy reaches **0.84** in classic mode
  * **With DA + LTM:** Total accuracy reaches **0.93** in classic mode, with **perfect scores** on easy and medium difficulty and **0.80** on hard tasks
  * **Token efficiency:** Less than half the tokens of baseline after 15 turns, with the efficiency gap widening as dialogues extend

The results demonstrate that structured memory simultaneously improves reasoning accuracy while enforcing computational scalability, a rare case where better performance comes with lower cost.

===== Significance =====
CogMem transforms LLM behavior from reactive, context-dependent reasoning into adaptive, self-consistent systems capable of sustained multi-turn inference. By drawing on decades of cognitive psychology research, it provides a principled architecture for managing the fundamental tension between comprehensive memory and bounded computation.

The three-layer design offers a general template for any system that must reason coherently over extended interactions, from customer support to scientific research assistants to autonomous coding agents.

===== See Also =====
  * [[spreading_activation_memory|Spreading Activation Memory]]
  * [[hierarchical_memory|Hierarchical Memory and Context Management]]
  * [[mem0|Mem0]]
  * [[neural_computers|Neural Computers]]
  * [[memory|Memory Management for LLM Agents]]

===== References =====