====== How to Build an AI Assistant ======

An AI assistant goes beyond a simple chatbot by incorporating memory, tool use, personality, and multi-step reasoning. This guide covers the architecture, core components, and production considerations for building a capable assistant from scratch.

===== Architecture =====

A production AI assistant has six layers:

  - **User Interface** -- chat widget, API endpoint, or voice interface
  - **Prompt Processing** -- input validation, normalization, and safety checks
  - **LLM Reasoning Engine** -- the core model that understands intent and generates responses
  - **Tool Registry** -- available functions the LLM can invoke (APIs, databases, calculators)
  - **Tool Execution Engine** -- safely invokes tools and manages outputs
  - **Response Synthesizer** -- formats the final response for the user

The key insight is separating the LLM's reasoning from deterministic tool execution. The LLM decides what to do; external tools do the actual work reliably. ((Source: [[https://www.nivalabs.ai/blogs/llm-function-calling-and-tool-use-in-python-building-intelligent-ai-assistants|NivaLabs - LLM Function Calling and Tool Use]]))

===== Choosing an LLM Backbone =====

Evaluate models based on:

  * **Context length** -- how much conversation history and tool output can fit
  * **Function calling support** -- native tool use capability
  * **Reasoning quality** -- ability to decompose complex tasks
  * **Latency** -- response time for interactive use
  * **Cost** -- per-token pricing or self-hosting expense

^ Model ^ Context ^ Tool Use ^ Best For ^
| GPT-4o | 128K | Excellent | General-purpose, rapid prototyping |
| Claude 3.5/4 | 200K+ | Excellent | Long-context tasks, nuanced reasoning |
| Llama 3.1 70B | 128K | Good | Self-hosted production |
| Qwen 3 32B | 128K | Strong | Multilingual, cost-efficient self-hosting |

A practical approach is to prototype with a proprietary model, then evaluate whether a self-hosted open model meets your quality bar. ((Source: [[https://www.glean.com/blog/llm-choice-2026|Glean - LLM Choice]]))

===== Memory Systems =====

=== Short-Term Memory ===

The conversation message array serves as short-term memory. For long conversations, implement a sliding window:

  * Keep the system prompt and the most recent N messages in full
  * Summarize older messages into a condensed context block
  * Use token counting to stay within the model's context limit

=== Long-Term Memory ===

Use a vector database (Pinecone, Weaviate, Qdrant, ChromaDB) to store and retrieve:

  * User preferences and profile data
  * Past conversation summaries
  * Domain knowledge documents (RAG)

The retrieval pipeline: embed the current query, search the vector store for similar entries, inject retrieved context into the prompt. This gives the assistant memory that persists across sessions. ((Source: [[https://www.acalvio.com/blog/active-defense/building-an-llm-powered-cybersecurity-ai-assistant/|Acalvio - Building an LLM-Powered AI Assistant]]))

=== Memory Architecture Pattern ===

Combine both: buffer recent conversation in-memory, persist summaries and key facts to the vector store after each session. On new sessions, retrieve relevant long-term memories and prepend them to the conversation.

===== Tool Use and Function Calling =====

Tools transform an assistant from a text generator into an action-taker. Implement a tool registry:

  * Each tool has a name, description, and JSON Schema for parameters
  * The LLM decides when to call a tool based on user intent
  * The execution engine validates parameters, runs the tool, and returns results
  * Results are fed back to the LLM for reasoning

Common tools for assistants:

  * Web search and URL fetching
  * Database queries
  * Calendar and email APIs
  * File read/write operations
  * Calculations and data transformations

((Source: [[https://www.nivalabs.ai/blogs/llm-function-calling-and-tool-use-in-python-building-intelligent-ai-assistants|NivaLabs - LLM Function Calling and Tool Use]]))

===== Personality and System Prompt Design =====

The system prompt defines the assistant's behavior, boundaries, and character:

  * **Role definition** -- ''You are a customer support specialist for Acme Corp''
  * **Behavioral guidelines** -- tone, formality level, response length
  * **Knowledge boundaries** -- what topics to address and which to decline
  * **Safety constraints** -- prohibited actions, harm-reduction rules
  * **Output format** -- structured responses, markdown, specific templates

Harden the system prompt against injection attacks:

  * Use clear delimiters between system instructions and user input
  * Test with adversarial prompts (jailbreak attempts, role-steering)
  * Layer guardrails on top of prompt-level defenses

((Source: [[https://splx.ai/blog/system-prompt-hardening-the-backbone-of-automated-ai-security|SPLX - System Prompt Hardening]]))

===== Frameworks =====

^ Framework ^ Best For ^ Key Strengths ^
| LangGraph | Complex stateful workflows | Explicit state management, conditional branching, human-in-the-loop |
| CrewAI | Multi-agent collaboration | Role-based agents, task delegation |
| AutoGen | Conversational multi-agent | Message-passing architecture, flexible |
| Semantic Kernel | Enterprise / Microsoft stack | Azure integration, plugin architecture |

For a single-assistant system, LangGraph provides the most control over the execution flow. For multi-agent setups where specialists collaborate, CrewAI or AutoGen are better fits.

===== Deployment =====

=== Containerization ===

Package the assistant in Docker for consistent deployment across environments. Include the application code, dependencies, and configuration -- but not the model weights (pull those at runtime or mount from a volume).

=== Production Checklist ===

  * **Rate limiting** -- per-user and per-minute quotas to prevent abuse
  * **Error handling** -- exponential backoff for API failures, fallback to simpler models
  * **Cost monitoring** -- log token counts per request, set budget alerts
  * **Caching** -- cache system prompts and frequent RAG results
  * **Model routing** -- use a small fast model for simple queries, a large model for complex ones
  * **Audit logging** -- immutable logs of all interactions for compliance
  * **Graceful degradation** -- continue functioning with reduced capability when components fail

((Source: [[https://www.sandgarden.com/learn/llm-inference|Sandgarden - LLM Inference]]))

===== See Also =====

  * [[how_to_build_a_chatbot|How to Build a Chatbot]]
  * [[how_to_use_function_calling|How to Use Function Calling]]
  * [[how_to_create_an_agent|How to Create an Agent]]
  * [[how_to_implement_guardrails|How to Implement Guardrails]]

===== References =====