Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Voice agents are AI systems that conduct real-time spoken conversations, combining automatic speech recognition (ASR), large language model reasoning, and text-to-speech (TTS) synthesis into a seamless pipeline. By 2026, voice agents handle sales calls, customer support, appointment scheduling, and complex multi-turn dialogues with sub-second latency and human-like naturalness.
The standard voice agent pipeline processes audio in three stages:
The total end-to-end latency target is under 800ms for natural conversational flow.
The OpenAI Realtime API bypasses the traditional pipeline by streaming audio directly to and from the LLM, enabling true speech-to-speech processing. This reduces latency by eliminating the ASR/TTS serialization steps and allows the model to use vocal cues like tone and emphasis.
import asyncio import websockets import json async def voice_agent_realtime(): url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview" headers = { "Authorization": f"Bearer {api_key}", "OpenAI-Beta": "realtime=v1" } async with websockets.connect(url, extra_headers=headers) as ws: # Configure the session await ws.send(json.dumps({ "type": "session.update", "session": { "modalities": ["text", "audio"], "voice": "alloy", "instructions": "You are a helpful customer service agent.", "turn_detection": {"type": "server_vad"} } })) # Stream audio input and receive audio output async for message in ws: event = json.loads(message) if event["type"] == "response.audio.delta": play_audio_chunk(event["delta"]) elif event["type"] == "response.text.delta": print(event["delta"], end="")
Voice agents connect to phone networks through SIP trunking and WebRTC:
Latency is the critical metric for voice agents. Key optimization strategies:
| Component | Target Latency | Leading Solution |
| ASR | 100-300ms | Deepgram, multilingual-tuned |
| LLM | 200-500ms | OpenAI Realtime, streaming |
| TTS | <200ms | ElevenLabs, Deepgram Aura-2 |
| Total E2E | <800ms | Retell AI, SquadStack |