====== Voice Agents ====== Voice agents are AI systems that conduct real-time spoken conversations, combining automatic speech recognition (ASR), large language model reasoning, and text-to-speech (TTS) synthesis into a seamless pipeline. By 2026, voice agents handle sales calls, customer support, appointment scheduling, and complex multi-turn dialogues with sub-second latency and human-like naturalness. ===== Voice Agent Architecture ===== The standard voice agent pipeline processes audio in three stages: - **ASR (Speech-to-Text)** — Converts spoken audio to text in 100-300ms (Deepgram, Whisper, AssemblyAI) - **LLM Reasoning** — Processes the transcribed text, maintains conversation context, and generates a response in 200-500ms - **TTS (Text-to-Speech)** — Converts the response to natural-sounding audio in under 200ms (ElevenLabs, Deepgram Aura-2) The total end-to-end latency target is under 800ms for natural conversational flow. ===== Speech-to-Speech Models ===== The [[https://platform.openai.com/docs/guides/realtime|OpenAI Realtime API]] bypasses the traditional pipeline by streaming audio directly to and from the LLM, enabling true speech-to-speech processing. This reduces latency by eliminating the ASR/TTS serialization steps and allows the model to use vocal cues like tone and emphasis.


import asyncio
import websockets
import json

async def voice_agent_realtime():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",
                "instructions": "You are a helpful customer service agent.",
                "turn_detection": {"type": "server_vad"}
            }
        }))

        # Stream audio input and receive audio output
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio.delta":
                play_audio_chunk(event["delta"])
            elif event["type"] == "response.text.delta":
                print(event["delta"], end="")

===== Telephony Integration ===== Voice agents connect to phone networks through SIP trunking and WebRTC: | **Provider** | **Protocol** | **Key Feature** | | [[https://www.twilio.com|Twilio]] | SIP, WebRTC | Programmable voice, global reach, media streams | | [[https://www.vonage.com|Vonage]] | SIP, WebSocket | AI Studio, low-latency audio streaming | | [[https://vapi.ai|Vapi]] | SIP, WebRTC | 4200+ config points, CRM orchestration | | [[https://www.retellai.com|Retell AI]] | SIP, WebSocket | Real-time adaptive workflows, built-in analytics | ===== Latency Optimization ===== Latency is the critical metric for voice agents. Key optimization strategies: * **Streaming responses** — Begin TTS playback before the full response is generated * **Endpoint detection** — Voice Activity Detection (VAD) determines when the user stops speaking * **Interruption handling** — Agent stops speaking immediately when the user interjects * **Connection pooling** — Maintain persistent WebSocket connections to avoid handshake overhead * **Edge deployment** — Run ASR/TTS models closer to the user geographically * **Response caching** — Cache common responses for instant playback | **Component** | **Target Latency** | **Leading Solution** | | ASR | 100-300ms | Deepgram, multilingual-tuned | | LLM | 200-500ms | OpenAI Realtime, streaming | | TTS | <200ms | ElevenLabs, Deepgram Aura-2 | | Total E2E | <800ms | Retell AI, SquadStack | ===== Voice Agent Platforms ===== * **[[https://elevenlabs.io|ElevenLabs]]** — Industry-leading TTS quality with 4.23 MOS scores, hyper-realistic voices, and multilingual support * **[[https://vapi.ai|Vapi]]** — Enterprise voice agent platform with 4,200+ configuration points, CRM integration, and global telephony * **[[https://www.retellai.com|Retell AI]]** — Real-time adaptive voice agents with built-in analytics and no external carrier dependency * **[[https://www.polyai.com|PolyAI]]** — Specialized in multilingual contact center agents with robust interruption handling ===== Key Challenges ===== * **Interruption handling** — Natural conversations involve overlapping speech that agents must handle gracefully * **Accent and noise robustness** — ASR must perform well across diverse speakers and environments * **Emotional intelligence** — Detecting and responding appropriately to caller sentiment * **Multilingual code-switching** — Handling conversations that switch between languages mid-sentence ===== References ===== * [[https://platform.openai.com/docs/guides/realtime|OpenAI Realtime API Guide]] * [[https://vellum.ai/blog/ai-voice-agent-platforms-guide|AI Voice Agent Platforms Guide]] * [[https://deepgram.com/learn/best-voice-ai-agents-2026-buyers-guide|Deepgram Voice AI Buyers Guide 2026]] ===== See Also ===== * [[agent_orchestration]] — Orchestrating voice agent workflows * [[function_calling]] — Tool calling during voice conversations * [[agent_memory_frameworks]] — Maintaining conversation memory across calls * [[agent_debugging]] — Observability for voice agent pipelines