Pipeline Architecture
Real-Time Streaming
Speech-to-Text Engines
Text-to-Speech Engines
Voice Agent Platforms
LLM Integration
Latency Optimization
Interruption Handling
Production Deployment
See Also
References

How to Build a Voice Agent

A voice agent enables real-time spoken conversation with an AI. It combines speech-to-text (STT), an LLM for reasoning, and text-to-speech (TTS) into a pipeline that processes audio streams with sub-second latency. This guide covers the architecture, component selection, and production deployment.

Pipeline Architecture

The standard voice agent pipeline has three stages:

Audio In → **STT** → Text → **LLM** → Text → **TTS** → Audio Out

Each stage runs as a streaming component, processing data incrementally rather than waiting for complete inputs:

STT receives audio chunks (100-250ms segments) and emits partial transcripts
LLM receives the transcript and streams token-by-token responses
TTS converts text chunks to audio as they arrive, without waiting for the full response

Architecture Patterns

Pattern	Description	Latency	Complexity
Sequential	STT completes → LLM completes → TTS completes	1-2 seconds	Low
Streaming	All three stages stream concurrently	300-800ms	Medium
Unified Multimodal	Single model processes audio directly (e.g., GPT-4o voice)	200-500ms	Low (but limited control)

The streaming pattern is the production standard. An orchestration layer manages the data flow between components, handles interruptions, and maintains conversation state. ¹⁾

Real-Time Streaming

WebSocket

Bidirectional data streaming over HTTP. The client sends audio chunks and receives audio/text responses on the same connection. Ideal for web and cloud-based applications.

WebRTC

Peer-to-peer audio/video with built-in NAT traversal, echo cancellation, and noise suppression. Better for telephony integration (SIP trunking) and scenarios requiring the lowest possible latency.

Both protocols support the streaming pipeline. WebSocket is simpler to implement; WebRTC provides better audio quality and telephony compatibility. ²⁾

Speech-to-Text Engines

Engine	Latency	Accuracy	Strengths	Best For
Deepgram	<300ms	High (contextual)	Real-time streaming, noise-robust, custom vocabulary	Production phone/web agents
OpenAI Whisper	500ms+	Excellent offline	Open-source, multilingual, self-hostable	Cost-sensitive, batch processing
AssemblyAI	<500ms	High	Endpointing, speaker diarization, real-time	Multi-speaker conversations

Deepgram leads for production voice agents due to its streaming latency and accuracy. Whisper is best for self-hosted or offline scenarios where latency is less critical. ³⁾

Text-to-Speech Engines

Engine	Latency	Voice Quality	Strengths	Best For
ElevenLabs	<500ms	Ultra-realistic	Emotional prosody, voice cloning	Expressive, personality-driven agents
PlayHT	<300ms	Natural	Fast cloning, multilingual	Scalable streaming applications
Cartesia	<200ms	High-fidelity	Ultra-low latency	Real-time interruption handling

Cartesia excels when latency is the top priority. ElevenLabs produces the most human-like voices for agents where personality and expression matter. ⁴⁾

Voice Agent Platforms

Platforms provide pre-built orchestration so you do not need to wire together STT, LLM, and TTS manually:

Platform	Type	Key Features	Customization	Best For
Vapi	Framework	Orchestration layer, provider-agnostic, telephony	High	Custom voice agents with full control
LiveKit	Open-source	WebRTC agents, real-time streaming, scalable	Very high	Self-hosted, maximum flexibility
Retell AI	End-to-end	Unified API, interruption handling	Medium	Rapid prototyping
Bland AI	End-to-end	Conversational focus, easy deployment	Medium	Quick phone agent deployment

Vapi and LiveKit are for teams that want control over each pipeline component. Retell and Bland are for shipping fast with less engineering effort. ⁵⁾

LLM Integration

The LLM sits at the center of the pipeline, receiving transcripts and generating responses:

Conversation history – maintain a message array with previous turns for context
Function calling – enable the LLM to trigger actions (book appointments, query databases)
RAG integration – retrieve knowledge base documents to ground responses
Response formatting – instruct the LLM to respond in short, spoken-friendly sentences
Sentiment awareness – adjust tone and content based on detected user emotion

Optimize LLM prompts for spoken output: shorter sentences, no markdown, no code blocks, conversational tone. ⁶⁾

Latency Optimization

Target: under 1 second end-to-end (STT ~300ms + LLM ~300ms + TTS ~200ms).

Stream everything – do not wait for complete outputs at any stage
Use fast STT/TTS providers – Deepgram for STT, Cartesia for TTS minimizes pipeline latency
Choose a fast LLM – smaller models (GPT-4o-mini, Llama 3 8B) reduce time-to-first-token
Edge deployment – run components geographically close to users
VAD (Voice Activity Detection) – detect speech endpoints quickly to minimize silence before processing
Prefetch and cache – preload common responses, cache TTS for frequent phrases

Measure Time-to-First-Token (TTFT) and Time-to-First-Audio as primary latency metrics. ⁷⁾

Interruption Handling

Users frequently interrupt (barge-in). The agent must:

Detect user speech onset via VAD while the agent is speaking
Immediately stop TTS playback
Flush the current LLM generation
Process the new user input from the interruption point
Resume the conversation naturally (e.g., Sorry, go ahead)

Platforms like LiveKit and Vapi handle interruption natively. For custom builds, implement a state machine that transitions between listening, thinking, speaking, and interrupted states. ⁸⁾

Production Deployment

Follow a phased approach:

Define KPIs – containment rate (40-60% target), average handle time, user satisfaction (CSAT)
Prepare integrations – connect to telephony (Twilio), CRM, knowledge base
Alpha test – internal red-teaming with adversarial scenarios
Beta test – route 10% of traffic to the voice agent, A/B test against human agents
Scale – expand traffic gradually, monitor KPIs, iterate

Cost Considerations

Voice agent costs are typically $0.05-0.20 per minute depending on the provider stack. Major cost components:

STT: $0.005-0.02/minute
LLM: $0.01-0.05/minute (depends on model and verbosity)
TTS: $0.01-0.05/minute
Telephony: $0.01-0.03/minute

⁹⁾

References

¹⁾ , ²⁾ , ⁴⁾ , ⁷⁾ , ⁸⁾

Source: AssemblyAI - Voice AI Stack

³⁾ , ⁶⁾

Source: Deepgram - Voice AI Agents

⁵⁾

Source: Vellum - Voice Agent Platforms Guide

⁹⁾

Source: Master of Code - Voice AI Development Costs

Table of Contents