====== Conversation History Management ====== Conversation history management addresses a fundamental challenge in LLM applications: large language models are stateless and have no inherent knowledge of previous interactions or earlier parts of current conversations. ((source [[https://docs.redisvl.com/en/latest/user_guide/07_message_history.html|RedisVL: Message History]])) Effective history management requires balancing three competing demands: maintaining conversational context, controlling token costs, and respecting model context window limits. ((source [[https://community.openai.com/t/best-practices-for-cost-efficient-high-quality-context-management-in-long-ai-chats/1373996|OpenAI Community: Context Management Best Practices]])) ===== The Context Window Challenge ===== LLMs operate within a defined **context window** — the maximum number of tokens they can process at once. As conversations grow longer, accumulated chat history consumes increasingly large portions of this fixed budget, leaving less room for new user input and limiting the model's ability to generate responses. ((source [[https://devblogs.microsoft.com/agent-framework/managing-chat-history-for-large-language-models-llms/|Microsoft: Managing Chat History for LLMs]])) Modern context windows vary significantly across models: GPT-4 Turbo supports 128K tokens, Claude supports up to 200K tokens, and Gemini 1.5 Pro extends to 1M tokens. Despite these expansions, storing complete conversation history remains economically inefficient at scale since most LLM pricing is token-based. ===== Sliding Window Approaches ===== The simplest strategy involves sending only the **last N messages** to the LLM, assuming recent context contains everything needed for coherent responses. ((source [[https://mem0.ai/blog/llm-chat-history-summarization-guide-2025|Mem0: Chat History Summarization Guide]])) A more sophisticated variant uses **token-count-based truncation** rather than message count, calculating total tokens in chat history and truncating when approaching the model's context window limit. ((source [[https://devblogs.microsoft.com/agent-framework/managing-chat-history-for-large-language-models-llms/|Microsoft: Managing Chat History for LLMs]])) While computationally efficient, sliding window approaches suffer from significant context loss when earlier conversation segments contain critical information. ===== Summarization Strategies ===== **Contextual summarization** periodically compresses older conversation segments while keeping recent messages intact. A typical implementation summarizes everything older than 20 messages while preserving the last 10 messages verbatim. ((source [[https://mem0.ai/blog/llm-chat-history-summarization-guide-2025|Mem0: Chat History Summarization Guide]])) Microsoft's approach involves compressing older messages into summaries while maintaining the system message, which is critical for ensuring the LLM responds appropriately. When users ask follow-up questions that depend on earlier context, summarization-based approaches retain necessary context that truncation-based methods lose. ((source [[https://devblogs.microsoft.com/agent-framework/managing-chat-history-for-large-language-models-llms/|Microsoft: Managing Chat History for LLMs]])) **Recursive summarization** automatically compresses conversation history every time it exceeds a defined threshold of messages or tokens. While this retains context from the whole conversation, some details inevitably disappear in compression. ((source [[https://vellum.ai/blog/how-should-i-manage-memory-for-my-llm-chatbot|Vellum: Managing Memory for LLM Chatbots]])) **Hybrid summarize-and-buffer** approaches summarize conversation up to the Nth message while passing in the last N messages verbatim, providing both historical context and complete recent context. ((source [[https://vellum.ai/blog/how-should-i-manage-memory-for-my-llm-chatbot|Vellum: Managing Memory for LLM Chatbots]])) ===== Memory Architectures ===== Modern LLM applications employ hierarchical memory systems: * **Immediate working memory**: Handles the current session, typically preserved in full detail for maintaining conversational flow. ((source [[https://mem0.ai/blog/llm-chat-history-summarization-guide-2025|Mem0: Chat History Summarization Guide]])) * **Episodic memory**: Stores important past interactions such as key decisions, user preferences, or critical information from earlier conversations. * **Semantic memory**: Extracts general knowledge patterns accumulated over time, such as frequently asked questions and common user intents. * **Vectorized memory**: Stores past interactions as embeddings in a vector database. When historical context is needed, the system searches for semantically similar past conversations and injects only the most relevant snippets. ((source [[https://mem0.ai/blog/llm-chat-history-summarization-guide-2025|Mem0: Chat History Summarization Guide]])) ===== RAG-Based Memory ===== **Specific context retrieval** uses vector databases to store long conversations, then retrieves the most relevant pieces based on information in recent messages. This combines full-history retention with efficient token usage by retrieving only contextually relevant segments. ((source [[https://vellum.ai/blog/how-should-i-manage-memory-for-my-llm-chatbot|Vellum: Managing Memory for LLM Chatbots]])) Production implementations track token usage and context window continuously, triggering auto-compaction when usage crosses model thresholds (typically around 90% of the context window). Compaction rewrites history into: initial context + recent user messages (capped) + a handoff summary. ((source [[https://community.openai.com/t/best-practices-for-cost-efficient-high-quality-context-management-in-long-ai-chats/1373996|OpenAI Community: Context Management Best Practices]])) ===== Tools and Frameworks ===== * **LangChain**: Provides memory management capabilities for multi-turn interactions using message arrays and the AIMessage class. ((source [[https://codesignal.com/learn/courses/langchain-chat-essentials-in-javascript-1/lessons/managing-conversation-history-with-langchain-in-javascript-1|CodeSignal: LangChain Conversation History]])) * **Mem0**: Implements production-grade memory management with contextual summarization, vectorized memory, multi-level hierarchies, importance scoring, decay mechanisms, and conflict resolution. ((source [[https://mem0.ai/blog/llm-chat-history-summarization-guide-2025|Mem0: Chat History Summarization Guide]])) * **RedisVL**: Structures, stores, and retrieves conversation history using Redis, appending previous history to each subsequent LLM call. ((source [[https://docs.redisvl.com/en/latest/user_guide/07_message_history.html|RedisVL: Message History]])) * **Microsoft Semantic Kernel** (v1.35.0+): Built-in chat history reducers supporting truncation by message count, token count, summarization, or custom implementations. ((source [[https://devblogs.microsoft.com/agent-framework/managing-chat-history-for-large-language-models-llms/|Microsoft: Managing Chat History for LLMs]])) * **MemGPT**: Implements virtual context management inspired by operating system memory hierarchies, paging context in and out of the LLM's limited window. ===== Production Best Practices ===== * **Aggressively bound noisy content**: Truncate tool outputs based on token or byte limits; budget function output to prevent excessive context consumption. ((source [[https://community.openai.com/t/best-practices-for-cost-efficient-high-quality-context-management-in-long-ai-chats/1373996|OpenAI Community: Context Management Best Practices]])) * **Preserve system messages**: The system message must be maintained across all history reduction techniques, as removing it degrades response quality. ((source [[https://devblogs.microsoft.com/agent-framework/managing-chat-history-for-large-language-models-llms/|Microsoft: Managing Chat History for LLMs]])) * **Avoid incomplete function-calling sequences**: When truncating history, ensure function-calling message sequences remain complete. ((source [[https://devblogs.microsoft.com/agent-framework/managing-chat-history-for-large-language-models-llms/|Microsoft: Managing Chat History for LLMs]])) * **Continuous token tracking**: Monitor token usage in real-time with servers tracking usage and local estimates as fallbacks. * **Importance-weighted preservation**: Different conversation elements merit different preservation priorities; critical information should persist longer than routine exchanges. ((source [[https://mem0.ai/blog/llm-chat-history-summarization-guide-2025|Mem0: Chat History Summarization Guide]])) * **UI-level hiding**: When using summarization, hide summary messages from the user interface to avoid displaying technical metadata. ===== See Also ===== * [[contextual_priming]] * [[vector_embeddings]] * [[human_in_the_loop]] ===== References =====