====== How to Build a Voice Agent ====== A voice agent enables real-time spoken conversation with an AI. It combines speech-to-text (STT), an LLM for reasoning, and text-to-speech (TTS) into a pipeline that processes audio streams with sub-second latency. This guide covers the architecture, component selection, and production deployment. ===== Pipeline Architecture ===== The standard voice agent pipeline has three stages: Audio In → **STT** → Text → **LLM** → Text → **TTS** → Audio Out Each stage runs as a streaming component, processing data incrementally rather than waiting for complete inputs: * **STT** receives audio chunks (100-250ms segments) and emits partial transcripts * **LLM** receives the transcript and streams token-by-token responses * **TTS** converts text chunks to audio as they arrive, without waiting for the full response === Architecture Patterns === ^ Pattern ^ Description ^ Latency ^ Complexity ^ | Sequential | STT completes → LLM completes → TTS completes | 1-2 seconds | Low | | Streaming | All three stages stream concurrently | 300-800ms | Medium | | Unified Multimodal | Single model processes audio directly (e.g., GPT-4o voice) | 200-500ms | Low (but limited control) | The streaming pattern is the production standard. An orchestration layer manages the data flow between components, handles interruptions, and maintains conversation state. ((Source: [[https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents|AssemblyAI - Voice AI Stack]])) ===== Real-Time Streaming ===== === WebSocket === Bidirectional data streaming over HTTP. The client sends audio chunks and receives audio/text responses on the same connection. Ideal for web and cloud-based applications. === WebRTC === Peer-to-peer audio/video with built-in NAT traversal, echo cancellation, and noise suppression. Better for telephony integration (SIP trunking) and scenarios requiring the lowest possible latency. Both protocols support the streaming pipeline. WebSocket is simpler to implement; WebRTC provides better audio quality and telephony compatibility. ((Source: [[https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents|AssemblyAI - Voice AI Stack]])) ===== Speech-to-Text Engines ===== ^ Engine ^ Latency ^ Accuracy ^ Strengths ^ Best For ^ | Deepgram | <300ms | High (contextual) | Real-time streaming, noise-robust, custom vocabulary | Production phone/web agents | | OpenAI Whisper | 500ms+ | Excellent offline | Open-source, multilingual, self-hostable | Cost-sensitive, batch processing | | AssemblyAI | <500ms | High | Endpointing, speaker diarization, real-time | Multi-speaker conversations | **Deepgram** leads for production voice agents due to its streaming latency and accuracy. **Whisper** is best for self-hosted or offline scenarios where latency is less critical. ((Source: [[https://deepgram.com/learn/what-is-a-voice-ai-agent-2026|Deepgram - Voice AI Agents]])) ===== Text-to-Speech Engines ===== ^ Engine ^ Latency ^ Voice Quality ^ Strengths ^ Best For ^ | ElevenLabs | <500ms | Ultra-realistic | Emotional prosody, voice cloning | Expressive, personality-driven agents | | PlayHT | <300ms | Natural | Fast cloning, multilingual | Scalable streaming applications | | Cartesia | <200ms | High-fidelity | Ultra-low latency | Real-time interruption handling | **Cartesia** excels when latency is the top priority. **ElevenLabs** produces the most human-like voices for agents where personality and expression matter. ((Source: [[https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents|AssemblyAI - Voice AI Stack]])) ===== Voice Agent Platforms ===== Platforms provide pre-built orchestration so you do not need to wire together STT, LLM, and TTS manually: ^ Platform ^ Type ^ Key Features ^ Customization ^ Best For ^ | Vapi | Framework | Orchestration layer, provider-agnostic, telephony | High | Custom voice agents with full control | | LiveKit | Open-source | WebRTC agents, real-time streaming, scalable | Very high | Self-hosted, maximum flexibility | | Retell AI | End-to-end | Unified API, interruption handling | Medium | Rapid prototyping | | Bland AI | End-to-end | Conversational focus, easy deployment | Medium | Quick phone agent deployment | **Vapi** and **LiveKit** are for teams that want control over each pipeline component. **Retell** and **Bland** are for shipping fast with less engineering effort. ((Source: [[https://vellum.ai/blog/ai-voice-agent-platforms-guide|Vellum - Voice Agent Platforms Guide]])) ===== LLM Integration ===== The LLM sits at the center of the pipeline, receiving transcripts and generating responses: * **Conversation history** -- maintain a message array with previous turns for context * **Function calling** -- enable the LLM to trigger actions (book appointments, query databases) * **RAG integration** -- retrieve knowledge base documents to ground responses * **Response formatting** -- instruct the LLM to respond in short, spoken-friendly sentences * **Sentiment awareness** -- adjust tone and content based on detected user emotion Optimize LLM prompts for spoken output: shorter sentences, no markdown, no code blocks, conversational tone. ((Source: [[https://deepgram.com/learn/what-is-a-voice-ai-agent-2026|Deepgram - Voice AI Agents]])) ===== Latency Optimization ===== Target: under 1 second end-to-end (STT ~300ms + LLM ~300ms + TTS ~200ms). * **Stream everything** -- do not wait for complete outputs at any stage * **Use fast STT/TTS providers** -- Deepgram for STT, Cartesia for TTS minimizes pipeline latency * **Choose a fast LLM** -- smaller models (GPT-4o-mini, Llama 3 8B) reduce time-to-first-token * **Edge deployment** -- run components geographically close to users * **VAD (Voice Activity Detection)** -- detect speech endpoints quickly to minimize silence before processing * **Prefetch and cache** -- preload common responses, cache TTS for frequent phrases Measure Time-to-First-Token (TTFT) and Time-to-First-Audio as primary latency metrics. ((Source: [[https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents|AssemblyAI - Voice AI Stack]])) ===== Interruption Handling ===== Users frequently interrupt (barge-in). The agent must: - Detect user speech onset via VAD while the agent is speaking - Immediately stop TTS playback - Flush the current LLM generation - Process the new user input from the interruption point - Resume the conversation naturally (e.g., ''Sorry, go ahead'') Platforms like LiveKit and Vapi handle interruption natively. For custom builds, implement a state machine that transitions between ''listening'', ''thinking'', ''speaking'', and ''interrupted'' states. ((Source: [[https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents|AssemblyAI - Voice AI Stack]])) ===== Production Deployment ===== Follow a phased approach: - **Define KPIs** -- containment rate (40-60% target), average handle time, user satisfaction (CSAT) - **Prepare integrations** -- connect to telephony (Twilio), CRM, knowledge base - **Alpha test** -- internal red-teaming with adversarial scenarios - **Beta test** -- route 10% of traffic to the voice agent, A/B test against human agents - **Scale** -- expand traffic gradually, monitor KPIs, iterate === Cost Considerations === Voice agent costs are typically $0.05-0.20 per minute depending on the provider stack. Major cost components: * STT: $0.005-0.02/minute * LLM: $0.01-0.05/minute (depends on model and verbosity) * TTS: $0.01-0.05/minute * Telephony: $0.01-0.03/minute ((Source: [[https://masterofcode.com/blog/voice-ai-development-costs|Master of Code - Voice AI Development Costs]])) ===== See Also ===== * [[how_to_build_an_ai_assistant|How to Build an AI Assistant]] * [[how_to_use_function_calling|How to Use Function Calling]] * [[how_to_self_host_an_llm|How to Self-Host an LLM]] ===== References =====