Voice Agents
Voice agents are AI systems that conduct real-time spoken conversations, combining automatic speech recognition (ASR), large language model reasoning, and text-to-speech (TTS) synthesis into a seamless pipeline. By 2026, voice agents handle sales calls, customer support, appointment scheduling, and complex multi-turn dialogues with sub-second latency and human-like naturalness.
Voice Agent Architecture
The standard voice agent pipeline processes audio in three stages:
ASR (Speech-to-Text) — Converts spoken audio to text in 100-300ms (Deepgram, Whisper, AssemblyAI)
LLM Reasoning — Processes the transcribed text, maintains conversation context, and generates a response in 200-500ms
TTS (Text-to-Speech) — Converts the response to natural-sounding audio in under 200ms (ElevenLabs, Deepgram Aura-2)
The total end-to-end latency target is under 800ms for natural conversational flow.
Speech-to-Speech Models
The OpenAI Realtime API bypasses the traditional pipeline by streaming audio directly to and from the LLM, enabling true speech-to-speech processing. This reduces latency by eliminating the ASR/TTS serialization steps and allows the model to use vocal cues like tone and emphasis.
import asyncio
import websockets
import json
async def voice_agent_realtime():
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": f"Bearer {api_key}",
"OpenAI-Beta": "realtime=v1"
}
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure the session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy",
"instructions": "You are a helpful customer service agent.",
"turn_detection": {"type": "server_vad"}
}
}))
# Stream audio input and receive audio output
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
play_audio_chunk(event["delta"])
elif event["type"] == "response.text.delta":
print(event["delta"], end="")
Telephony Integration
Voice agents connect to phone networks through SIP trunking and WebRTC:
| Provider | Protocol | Key Feature |
| Twilio | SIP, WebRTC | Programmable voice, global reach, media streams |
| Vonage | SIP, WebSocket | AI Studio, low-latency audio streaming |
| Vapi | SIP, WebRTC | 4200+ config points, CRM orchestration |
| Retell AI | SIP, WebSocket | Real-time adaptive workflows, built-in analytics |
Latency Optimization
Latency is the critical metric for voice agents. Key optimization strategies:
Streaming responses — Begin TTS playback before the full response is generated
Endpoint detection — Voice Activity Detection (VAD) determines when the user stops speaking
Interruption handling — Agent stops speaking immediately when the user interjects
Connection pooling — Maintain persistent WebSocket connections to avoid handshake overhead
Edge deployment — Run ASR/TTS models closer to the user geographically
Response caching — Cache common responses for instant playback
| Component | Target Latency | Leading Solution |
| ASR | 100-300ms | Deepgram, multilingual-tuned |
| LLM | 200-500ms | OpenAI Realtime, streaming |
| TTS | <200ms | ElevenLabs, Deepgram Aura-2 |
| Total E2E | <800ms | Retell AI, SquadStack |
ElevenLabs — Industry-leading TTS quality with 4.23 MOS scores, hyper-realistic voices, and multilingual support
Vapi — Enterprise voice agent platform with 4,200+ configuration points, CRM integration, and global telephony
Retell AI — Real-time adaptive voice agents with built-in analytics and no external carrier dependency
PolyAI — Specialized in multilingual contact center agents with robust interruption handling
Key Challenges
Interruption handling — Natural conversations involve overlapping speech that agents must handle gracefully
Accent and noise robustness — ASR must perform well across diverse speakers and environments
Emotional intelligence — Detecting and responding appropriately to caller sentiment
Multilingual code-switching — Handling conversations that switch between languages mid-sentence
References
See Also