The Context Window Challenge
Sliding Window Approaches
Summarization Strategies
Memory Architectures
RAG-Based Memory
Tools and Frameworks
Production Best Practices
See Also
References

Conversation History Management

Conversation history management addresses a fundamental challenge in LLM applications: large language models are stateless and have no inherent knowledge of previous interactions or earlier parts of current conversations. ¹⁾ Effective history management requires balancing three competing demands: maintaining conversational context, controlling token costs, and respecting model context window limits. ²⁾

The Context Window Challenge

LLMs operate within a defined context window — the maximum number of tokens they can process at once. As conversations grow longer, accumulated chat history consumes increasingly large portions of this fixed budget, leaving less room for new user input and limiting the model's ability to generate responses. ³⁾

Modern context windows vary significantly across models: GPT-4 Turbo supports 128K tokens, Claude supports up to 200K tokens, and Gemini 1.5 Pro extends to 1M tokens. Despite these expansions, storing complete conversation history remains economically inefficient at scale since most LLM pricing is token-based.

Sliding Window Approaches

The simplest strategy involves sending only the last N messages to the LLM, assuming recent context contains everything needed for coherent responses. ⁴⁾

A more sophisticated variant uses token-count-based truncation rather than message count, calculating total tokens in chat history and truncating when approaching the model's context window limit. ⁵⁾

While computationally efficient, sliding window approaches suffer from significant context loss when earlier conversation segments contain critical information.

Summarization Strategies

Contextual summarization periodically compresses older conversation segments while keeping recent messages intact. A typical implementation summarizes everything older than 20 messages while preserving the last 10 messages verbatim. ⁶⁾

Microsoft's approach involves compressing older messages into summaries while maintaining the system message, which is critical for ensuring the LLM responds appropriately. When users ask follow-up questions that depend on earlier context, summarization-based approaches retain necessary context that truncation-based methods lose. ⁷⁾

Recursive summarization automatically compresses conversation history every time it exceeds a defined threshold of messages or tokens. While this retains context from the whole conversation, some details inevitably disappear in compression. ⁸⁾

Hybrid summarize-and-buffer approaches summarize conversation up to the Nth message while passing in the last N messages verbatim, providing both historical context and complete recent context. ⁹⁾

Memory Architectures

Modern LLM applications employ hierarchical memory systems:

Immediate working memory: Handles the current session, typically preserved in full detail for maintaining conversational flow. ¹⁰⁾
Episodic memory: Stores important past interactions such as key decisions, user preferences, or critical information from earlier conversations.
Semantic memory: Extracts general knowledge patterns accumulated over time, such as frequently asked questions and common user intents.
Vectorized memory: Stores past interactions as embeddings in a vector database. When historical context is needed, the system searches for semantically similar past conversations and injects only the most relevant snippets. ¹¹⁾

RAG-Based Memory

Specific context retrieval uses vector databases to store long conversations, then retrieves the most relevant pieces based on information in recent messages. This combines full-history retention with efficient token usage by retrieving only contextually relevant segments. ¹²⁾

Production implementations track token usage and context window continuously, triggering auto-compaction when usage crosses model thresholds (typically around 90% of the context window). Compaction rewrites history into: initial context + recent user messages (capped) + a handoff summary. ¹³⁾

Tools and Frameworks

LangChain: Provides memory management capabilities for multi-turn interactions using message arrays and the AIMessage class. ¹⁴⁾
Mem0: Implements production-grade memory management with contextual summarization, vectorized memory, multi-level hierarchies, importance scoring, decay mechanisms, and conflict resolution. ¹⁵⁾
RedisVL: Structures, stores, and retrieves conversation history using Redis, appending previous history to each subsequent LLM call. ¹⁶⁾
Microsoft Semantic Kernel (v1.35.0+): Built-in chat history reducers supporting truncation by message count, token count, summarization, or custom implementations. ¹⁷⁾
MemGPT: Implements virtual context management inspired by operating system memory hierarchies, paging context in and out of the LLM's limited window.

Production Best Practices

Aggressively bound noisy content: Truncate tool outputs based on token or byte limits; budget function output to prevent excessive context consumption. ¹⁸⁾
Preserve system messages: The system message must be maintained across all history reduction techniques, as removing it degrades response quality. ¹⁹⁾
Avoid incomplete function-calling sequences: When truncating history, ensure function-calling message sequences remain complete. ²⁰⁾
Continuous token tracking: Monitor token usage in real-time with servers tracking usage and local estimates as fallbacks.
Importance-weighted preservation: Different conversation elements merit different preservation priorities; critical information should persist longer than routine exchanges. ²¹⁾
UI-level hiding: When using summarization, hide summary messages from the user interface to avoid displaying technical metadata.