Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Building an AI chatbot involves selecting the right architecture, choosing an LLM, managing conversations across turns, and deploying to production. This guide walks through each step with practical patterns and tool recommendations.
Before writing code, clarify the chatbot's purpose and constraints:
Map out conversation flows, fallback paths, and escalation triggers. Define KPIs such as resolution rate, response latency, and user satisfaction. 1)
Two dominant patterns exist for LLM-powered chatbots:
RAG combines an LLM with external knowledge retrieval to reduce hallucinations and ground responses in source documents. The flow is:
RAG is ideal for knowledge-intensive bots where accuracy matters more than creativity. 2)
Agents maintain state across turns and can invoke tools (APIs, databases, calculators) autonomously. They use function calling to decide when to search, query, or act. This pattern suits booking assistants, sales bots, and multi-step workflows. 3)
Choose based on accuracy, cost, latency, and data privacy requirements:
| LLM | Type | Context Window | Strengths | Approximate Cost |
|---|---|---|---|---|
| GPT-4o | Proprietary | 128K tokens | High accuracy, multimodal, easy API | $5-15/M input tokens |
| Claude 3.5/4 | Proprietary | 200K+ tokens | Strong reasoning, safety-focused | $3-15/M tokens |
| Llama 3 | Open-source | 128K tokens | Customizable, self-hostable, fine-tunable | Free (hardware costs only) |
| Mistral | Open-source | 32-128K tokens | Fast inference, strong multilingual | Free (hardware costs only) |
Practical advice: Prototype with a proprietary model (GPT-4o or Claude) for fast iteration, then evaluate open-source alternatives for production cost savings. 4)
Every LLM has a maximum context window. For long conversations, implement a sliding window strategy:
Short-term memory lives in the message array for the current session. Long-term memory uses a vector database or key-value store (Redis) to persist user preferences and past interactions across sessions.
Assign unique session IDs per conversation. Store session state server-side. Implement concurrency controls to prevent race conditions when multiple messages arrive simultaneously. 5)
| Framework | Best For | Language | Key Features |
|---|---|---|---|
| LangChain | RAG pipelines, agents | Python/JS | Modular chains, memory, tool integration |
| LlamaIndex | Data ingestion and retrieval | Python | Index construction, query engines |
| Vercel AI SDK | Frontend streaming | TypeScript | React/Next.js hooks, multi-provider support |
| Botpress | Full-stack chatbots | Visual/JS | Drag-and-drop flows, autonomous nodes |
For a Python-first RAG chatbot, LangChain plus a vector store is the most common stack. For a JavaScript frontend with streaming, the Vercel AI SDK provides the smoothest developer experience. 6)
Deploy via managed services (AWS Bedrock, Google Vertex AI, Azure OpenAI) for automatic scaling and minimal infrastructure management. Cost scales with token usage.
Run open-source models on GPU instances (RTX 4090, A100) using Ollama, vLLM, or TGI inside Docker containers. Higher upfront cost but better privacy and lower per-token cost at scale.
Route simple queries to a small self-hosted model and complex queries to a proprietary API. This optimizes cost while maintaining quality. 7)