Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
GPT-Realtime-2 is OpenAI's flagship native speech-to-speech model designed for production voice agent applications. Released in 2026, it represents a significant advancement in real-time conversational AI by integrating GPT-5-class reasoning capabilities directly into a speech processing pipeline, enabling voice agents to understand, reason about, and respond to spoken input with minimal latency while maintaining sophisticated cognitive processing.1)
GPT-Realtime-2 functions as an end-to-end speech model that processes audio input and generates spoken responses without requiring intermediate text conversion steps. This native speech-to-speech architecture eliminates traditional pipeline inefficiencies where separate speech recognition, language understanding, and speech synthesis components introduce cumulative latency and potential information loss at conversion boundaries. Speech-to-speech models like GPT-Realtime-2 represent a category of neural models that take audio input and produce audio output directly, combining speech recognition, reasoning, and synthesis in a single pipeline.2) GPT-Realtime-2 implements speech-to-speech reasoning as a native capability that enables reasoning and tool use mid-conversation without converting to text intermediates, fundamentally distinguishing this approach from hybrid architectures that rely on intermediate representations.3)
The model achieves a score of 96.6% on the Big Bench Audio benchmark, a comprehensive evaluation suite for audio understanding tasks. This performance level indicates significant capability across diverse audio processing scenarios including noise robustness, speaker adaptation, and complex acoustic feature recognition. This represents a +5% improvement over its predecessor, GPT-Realtime-1.5, which was released three months prior and achieved 91.6% on the same benchmark.4) GPT-Realtime-2 demonstrates substantial advancement from earlier variants, with the Big Bench Audio score improving from 81.4% in earlier models to the current 96.6%.5) The 96.6% performance represents approximately a 15 percentage point improvement over GPT-Realtime-1.5 and demonstrates the model approaching saturation on this metric.6)
The reasoning capabilities integrated into GPT-Realtime-2 derive from GPT-5, OpenAI's reasoning-capable language model that provides the foundational cognitive processing enabling voice agents to handle complex customer service scenarios requiring deliberation beyond simple pattern matching.7) This enables the model to perform multi-step reasoning tasks during live conversations. This allows voice agents to handle complex queries requiring logical inference, contextual understanding across extended dialogue histories, and sophisticated problem-solving without delegating reasoning tasks to separate backend systems.
GPT-Realtime-2 supports a 128K token context window, allowing voice agents to maintain awareness of extended conversation histories and reference materials. This context capacity enables sophisticated multi-turn conversations where the model can track complex dialogue threads, maintain consistency with previously established facts, and reference earlier statements made during a conversation session. The expanded context window supersedes GPT-Realtime-1.5's 32K token capacity, significantly enhancing capability for extended dialogue management.8) OpenAI CEO Sam Altman framed the launch around behavioral shifts in how users interact with AI, noting increased voice usage when users need to “dump” lots of context, with the expanded context window directly addressing this use case.9)
The model implements adjustable reasoning effort through five distinct settings: minimal, low, medium, high, and xhigh. This graduated reasoning framework allows developers to balance computational cost against reasoning depth for specific use cases. Applications requiring rapid responses with basic reasoning can utilize minimal effort settings, while complex problem-solving scenarios can activate higher reasoning levels despite increased latency. The adjustable reasoning architecture employs a latency-hiding mechanism where models generate conversational preambles while background reasoning processing occurs, enabling improved responsiveness without sacrificing cognitive depth.10)
Time-to-first-audio metrics demonstrate the latency-reasoning tradeoff:
These metrics represent the elapsed time from receiving user audio input to generating the first audio output token, a critical performance characteristic for maintaining conversational naturalness in voice interactions.
GPT-Realtime-2 includes tool use capabilities, enabling voice agents to call external functions and APIs during conversations. This allows agents to perform actions such as database queries, third-party service integration, calendar management, and transaction processing while maintaining natural conversational flow without transferring control to separate systems.
The model supports transparency mechanisms that enable agents to explain their reasoning, clarify which tools they are invoking, and provide users with visibility into decision-making processes. This transparency feature addresses critical requirements for trustworthy AI deployment in production settings where users need to understand how systems arrive at conclusions or take actions on their behalf.
Interruption recovery functionality allows voice agents to gracefully handle user interruptions during agent responses. Rather than requiring complete restart of response generation, the model can recognize mid-speech interruptions, acknowledge user input, and pivot to address new user statements, creating more natural conversational experiences similar to human dialogue patterns.
GPT-Realtime-2 is specifically engineered for production voice agent deployment across customer service, technical support, accessibility applications, and conversational interfaces. The combination of low latency, sophisticated reasoning, and native speech processing enables real-time deployment scenarios previously requiring custom optimization or hybrid architectures. GPT-Realtime-2 demonstrates state-of-the-art performance in the speech-to-speech category with its 128K token context window and GPT-5-level reasoning capabilities for handling complex customer service scenarios.11) OpenAI's family of speech-to-speech models includes cheaper variants GPT-Realtime-Mini and Realtime-Nano that enable high-volume support work applications.12)
Organizations can develop voice-first interfaces for applications spanning virtual assistant systems, phone-based customer support automation, voice-controlled enterprise workflows, and accessibility tools for users preferring audio-based interaction modalities. The tool integration capabilities enable voice agents to function as autonomous systems capable of executing user requests without human intermediation. Deutsche Telekom has deployed live-translated voice support across 14 European markets using GPT-Realtime-2, demonstrating multinational deployment of advanced voice agents for customer service with real-time translation capabilities.13)
OpenAI has also released GPT-Realtime-Translate, a real-time speech translation model supporting streaming translation from 70+ input languages to 13 output languages, enabling live dubbing and voice-to-voice translation without pre-loaded captions.14)
Access to GPT-Realtime-2 and related models including GPT-Realtime-Whisper is provided through OpenAI's Realtime API platform for developers.15) The consumer-facing ChatGPT Voice Mode has not yet received the GPT-Realtime-2 upgrade despite API availability, with OpenAI signaling pending integration through public communications about larger consumer impact to follow, while Altman highlighted ongoing improvements to ChatGPT voice.16)
The adjustable reasoning effort framework requires careful configuration for specific use cases. Applications prioritizing responsiveness may sacrifice reasoning depth, while scenarios requiring complex analysis may accept higher latency. Developers must evaluate the reasoning quality-latency tradeoff for particular deployment contexts.
The 128K context window provides substantial dialogue capacity but remains finite; applications with extended multi-session histories may require context compression or retrieval-augmented approaches to maintain relevant information within the active context window.