GPT-Realtime-2 and GPT-Realtime-1.5 represent consecutive iterations of OpenAI's real-time conversational AI platform, with GPT-Realtime-2 introducing substantial enhancements across multiple dimensions of performance, capability, and user experience. Released in May 2026, GPT-Realtime-2 builds upon the foundation established by GPT-Realtime-1.5 (released approximately three months earlier) with measurable improvements in audio understanding, instruction adherence, context handling, and conversational robustness 1).
As an advanced voice AI model capable of thinking, calling tools, and maintaining conversational flow during live calls, GPT-Realtime-2 represents a significant progression in real-time AI interaction capabilities 2).
The most immediately quantifiable differences between the two versions appear in standardized benchmark performance. On the Big Bench Audio evaluation suite, GPT-Realtime-2 achieves 96.6% accuracy, representing a +15.2 percentage point improvement over GPT-Realtime-1.5's baseline performance 3).
Instruction retention capabilities show even more dramatic gains. When evaluated on Scale AI's instruction adherence benchmarks, GPT-Realtime-2 achieves 70.8% instruction retention compared to GPT-Realtime-1.5's 36.7%—nearly a doubling of performance. This improvement suggests enhanced ability to maintain user-specified constraints and preferences throughout extended conversations, addressing a critical limitation in earlier versions where models frequently drifted from initial instructions 4). As a baseline realtime voice interaction model, GPT-Realtime-2 serves as a benchmark for evaluating subsequent generations of realtime voice models 5).
GPT-Realtime-2 expands the maximum context window to 128K tokens, quadrupling the 32K token limit of GPT-Realtime-1.5. This four-fold expansion enables substantially longer conversational histories without context switching, improved document summarization and analysis capabilities, and more effective multi-turn dialogue where historical context remains accessible to the model throughout extended sessions.
The expanded context window addresses fundamental limitations in real-time systems where extended conversations or large document processing previously required manual context management or session fragmentation. By maintaining a 128K token context, GPT-Realtime-2 enables more natural conversational flows and reduces the cognitive burden on users managing conversation length constraints 6).
GPT-Realtime-2 introduces improved interruption recovery, a critical feature for real-time voice and text interactions where users frequently interject mid-response. The system now gracefully handles user interruptions by recognizing conversational overlap, managing the transition between user input and model response, and maintaining conversational coherence despite these natural interruptions. This enhancement particularly benefits voice-based interfaces where simultaneous speech naturally occurs 7).
GPT-Realtime-2 implements parallel tool calls with transparency, allowing the model to invoke multiple external tools simultaneously rather than sequentially. This architectural improvement reduces latency in multi-step workflows where independent tool invocations can execute concurrently. The transparency component ensures users can observe which tools the model is invoking, understand the parameters being passed, and verify the reasoning behind tool selection—addressing explainability concerns in earlier versions 8).
A novel feature in GPT-Realtime-2 involves adjustable reasoning effort levels, allowing users to trade computational resources against response quality for specific tasks. This mechanism enables users to specify how much computational reasoning they want the model to apply—from rapid shallow reasoning for time-sensitive queries to deeper analysis for complex problem-solving scenarios. This flexibility addresses heterogeneous use cases where optimal reasoning depth varies based on task requirements and latency constraints 9).
| Feature | GPT-Realtime-1.5 | GPT-Realtime-2 |
| Big Bench Audio Accuracy | 81.4% | 96.6% |
| Instruction Retention | 36.7% | 70.8% |
| Context Window | 32K tokens | 128K tokens |
| Tool Calls | Sequential | Parallel with transparency |
| Interruption Handling | Basic | Improved recovery |
| Reasoning Configuration | Fixed | Adjustable levels |