Table of Contents

Conversation History Management

Conversation history management addresses a fundamental challenge in LLM applications: large language models are stateless and have no inherent knowledge of previous interactions or earlier parts of current conversations. 1) Effective history management requires balancing three competing demands: maintaining conversational context, controlling token costs, and respecting model context window limits. 2)

The Context Window Challenge

LLMs operate within a defined context window — the maximum number of tokens they can process at once. As conversations grow longer, accumulated chat history consumes increasingly large portions of this fixed budget, leaving less room for new user input and limiting the model's ability to generate responses. 3)

Modern context windows vary significantly across models: GPT-4 Turbo supports 128K tokens, Claude supports up to 200K tokens, and Gemini 1.5 Pro extends to 1M tokens. Despite these expansions, storing complete conversation history remains economically inefficient at scale since most LLM pricing is token-based.

Sliding Window Approaches

The simplest strategy involves sending only the last N messages to the LLM, assuming recent context contains everything needed for coherent responses. 4)

A more sophisticated variant uses token-count-based truncation rather than message count, calculating total tokens in chat history and truncating when approaching the model's context window limit. 5)

While computationally efficient, sliding window approaches suffer from significant context loss when earlier conversation segments contain critical information.

Summarization Strategies

Contextual summarization periodically compresses older conversation segments while keeping recent messages intact. A typical implementation summarizes everything older than 20 messages while preserving the last 10 messages verbatim. 6)

Microsoft's approach involves compressing older messages into summaries while maintaining the system message, which is critical for ensuring the LLM responds appropriately. When users ask follow-up questions that depend on earlier context, summarization-based approaches retain necessary context that truncation-based methods lose. 7)

Recursive summarization automatically compresses conversation history every time it exceeds a defined threshold of messages or tokens. While this retains context from the whole conversation, some details inevitably disappear in compression. 8)

Hybrid summarize-and-buffer approaches summarize conversation up to the Nth message while passing in the last N messages verbatim, providing both historical context and complete recent context. 9)

Memory Architectures

Modern LLM applications employ hierarchical memory systems:

RAG-Based Memory

Specific context retrieval uses vector databases to store long conversations, then retrieves the most relevant pieces based on information in recent messages. This combines full-history retention with efficient token usage by retrieving only contextually relevant segments. 12)

Production implementations track token usage and context window continuously, triggering auto-compaction when usage crosses model thresholds (typically around 90% of the context window). Compaction rewrites history into: initial context + recent user messages (capped) + a handoff summary. 13)

Tools and Frameworks

Production Best Practices

See Also

References