A voice agent enables real-time spoken conversation with an AI. It combines speech-to-text (STT), an LLM for reasoning, and text-to-speech (TTS) into a pipeline that processes audio streams with sub-second latency. This guide covers the architecture, component selection, and production deployment.
The standard voice agent pipeline has three stages:
Audio In → **STT** → Text → **LLM** → Text → **TTS** → Audio Out
Each stage runs as a streaming component, processing data incrementally rather than waiting for complete inputs:
| Pattern | Description | Latency | Complexity |
|---|---|---|---|
| Sequential | STT completes → LLM completes → TTS completes | 1-2 seconds | Low |
| Streaming | All three stages stream concurrently | 300-800ms | Medium |
| Unified Multimodal | Single model processes audio directly (e.g., GPT-4o voice) | 200-500ms | Low (but limited control) |
The streaming pattern is the production standard. An orchestration layer manages the data flow between components, handles interruptions, and maintains conversation state. 1)
Bidirectional data streaming over HTTP. The client sends audio chunks and receives audio/text responses on the same connection. Ideal for web and cloud-based applications.
Peer-to-peer audio/video with built-in NAT traversal, echo cancellation, and noise suppression. Better for telephony integration (SIP trunking) and scenarios requiring the lowest possible latency.
Both protocols support the streaming pipeline. WebSocket is simpler to implement; WebRTC provides better audio quality and telephony compatibility. 2)
| Engine | Latency | Accuracy | Strengths | Best For |
|---|---|---|---|---|
| Deepgram | <300ms | High (contextual) | Real-time streaming, noise-robust, custom vocabulary | Production phone/web agents |
| OpenAI Whisper | 500ms+ | Excellent offline | Open-source, multilingual, self-hostable | Cost-sensitive, batch processing |
| AssemblyAI | <500ms | High | Endpointing, speaker diarization, real-time | Multi-speaker conversations |
Deepgram leads for production voice agents due to its streaming latency and accuracy. Whisper is best for self-hosted or offline scenarios where latency is less critical. 3)
| Engine | Latency | Voice Quality | Strengths | Best For |
|---|---|---|---|---|
| ElevenLabs | <500ms | Ultra-realistic | Emotional prosody, voice cloning | Expressive, personality-driven agents |
| PlayHT | <300ms | Natural | Fast cloning, multilingual | Scalable streaming applications |
| Cartesia | <200ms | High-fidelity | Ultra-low latency | Real-time interruption handling |
Cartesia excels when latency is the top priority. ElevenLabs produces the most human-like voices for agents where personality and expression matter. 4)
Platforms provide pre-built orchestration so you do not need to wire together STT, LLM, and TTS manually:
| Platform | Type | Key Features | Customization | Best For |
|---|---|---|---|---|
| Vapi | Framework | Orchestration layer, provider-agnostic, telephony | High | Custom voice agents with full control |
| LiveKit | Open-source | WebRTC agents, real-time streaming, scalable | Very high | Self-hosted, maximum flexibility |
| Retell AI | End-to-end | Unified API, interruption handling | Medium | Rapid prototyping |
| Bland AI | End-to-end | Conversational focus, easy deployment | Medium | Quick phone agent deployment |
Vapi and LiveKit are for teams that want control over each pipeline component. Retell and Bland are for shipping fast with less engineering effort. 5)
The LLM sits at the center of the pipeline, receiving transcripts and generating responses:
Optimize LLM prompts for spoken output: shorter sentences, no markdown, no code blocks, conversational tone. 6)
Target: under 1 second end-to-end (STT ~300ms + LLM ~300ms + TTS ~200ms).
Measure Time-to-First-Token (TTFT) and Time-to-First-Audio as primary latency metrics. 7)
Users frequently interrupt (barge-in). The agent must:
Sorry, go ahead)
Platforms like LiveKit and Vapi handle interruption natively. For custom builds, implement a state machine that transitions between listening, thinking, speaking, and interrupted states. 8)
Follow a phased approach:
Voice agent costs are typically $0.05-0.20 per minute depending on the provider stack. Major cost components: