For two years, voice AI developers faced the same miserable trade-off: pick a model that sounds human but can't think, or pick one that thinks but takes seven seconds to answer. Either the conversation broke, or the intelligence did. On May 7, OpenAI shipped three new models through its Realtime API that collapse that gap — headlined by GPT-Realtime-2, a speech-to-speech model with GPT-5-class reasoning baked directly into the voice layer. The Realtime API also officially exited beta and became generally available.
This isn't an incremental upgrade. It's the moment voice AI stops being a parlour trick and starts being infrastructure. If your business touches phone calls, customer support, bookings, or multilingual service — the economics just changed.
What OpenAI actually shipped
Three models, each targeting a different piece of the voice stack:
GPT-Realtime-2 is the headline act. It processes speech-to-speech with integrated reasoning, meaning it can listen to a customer, think through a complex request, call external tools (check a calendar, look up an order, query a database), and respond naturally — all in a single conversational turn. The context window quadrupled from 32K to 128K tokens, enough to hold a full customer history during a long call without the model losing track.
The clever engineering is in how OpenAI hid the thinking time. The model now generates preambles — short conversational fillers like "let me check that for you" — that play while reasoning runs in the background. It can also call multiple tools in parallel and narrate what it's doing: "checking your calendar" or "looking that up now." The silence that used to expose AI as AI now sounds like a person doing their job.
GPT-Realtime-Translate handles live translation across 70+ input languages and 13 output languages, keeping pace with the speaker in real time. Deutsche Telekom is already testing it for multilingual voice support across 14 European markets.
GPT-Realtime-Whisper delivers streaming speech-to-text with low enough latency to power live captions, meeting notes, and real-time transcription workflows.
The benchmarks — and the asterisk
The performance gains are genuine. GPT-Realtime-2 scored 96.6% on Big Bench Audio, up from 81.4% for GPT-Realtime-1.5 — a 15.2-point improvement on a benchmark that tests reasoning over audio. On Audio MultiChallenge, which measures multi-turn conversational intelligence, it scored 48.5% at the highest reasoning effort versus 34.7% for its predecessor.
But there's an important asterisk, as The Neuron Daily noted: the default reasoning effort for GPT-Realtime-2 is "low," which means those headline benchmarks — run at "high" and "xhigh" settings — aren't what most applications will ship with out of the box. Developers who want the smart version need to explicitly crank up reasoning effort, which adds latency and cost. OpenAI offers five levels (minimal, low, medium, high, xhigh) so builders can tune the trade-off for each use case.
Why the pricing matters more than the benchmarks
The real disruption here isn't the intelligence — it's the economics. GPT-Realtime-Translate at $0.034 per minute and GPT-Realtime-Whisper at $0.017 per minute aggressively undercut the traditional approach of stacking separate transcription, language model, and text-to-speech services together.
For context, enterprise voice solutions from established providers like ElevenLabs Conversational AI run $0.15-$0.30 per minute, and that's before adding reasoning capabilities. Google's Gemini Live competes on raw audio pricing at around $0.018 per minute, but doesn't yet match the integrated reasoning and tool-calling that GPT-Realtime-2 offers.
The competitive pressure is already visible. Within a day of OpenAI's announcement, ElevenLabs slashed API pricing by 40% across all voice synthesis tiers. The race to the bottom is on.
The business case for Australian companies
This matters because voice is where most businesses still run on humans and hold music. Gartner forecasts that conversational AI will reduce contact centre agent labour costs by $80 billion globally in 2026, with customer support automation already representing 44.2% of AI voice agent revenue. Production voice agent deployments have increased 340% year-over-year, and 67% of Fortune 500 companies now run production voice systems.
For an Australian business running a 10-person support team, the calculus is straightforward. A voice agent that can reason through complex requests, call your CRM, check order status, and handle the conversation in the customer's preferred language — that was a custom engineering project six months ago. Now it's an API call with published pricing.
The multilingual angle is particularly relevant for Australian businesses serving diverse communities or operating across the Asia-Pacific region. Live translation at three cents a minute removes a cost barrier that previously made multilingual voice support impractical for anyone below enterprise scale.
This follows OpenAI's broader push into business automation with workspace agents — but where those agents work through text and dashboards, GPT-Realtime-2 extends the same intelligence to the phone line.
What to watch
Three things will determine whether this lives up to the hype:
Latency in production. Benchmarks are run in controlled conditions. The real test is whether GPT-Realtime-2 at "medium" or "high" reasoning effort stays fast enough for natural conversation when your customer is frustrated and your tool calls hit a slow API. Early adopters like Zillow and Priceline — who is working toward full trip management by voice — will set the standard.
The competitive response. Google's Gemini Live, xAI's Grok Voice Agent API ($0.05/minute flat rate), and a growing field of open-source options like NVIDIA PersonaPlex are all competing for the same market. The OpenAI Realtime protocol is becoming a de facto standard — competitors are adopting its event schema — but pricing pressure will intensify through the rest of 2026.
Regulation and transparency. OpenAI's usage policies require developers to disclose when users are talking to AI. As voice agents get good enough to pass for human, expect regulators — including in Australia, where the AI governance framework is taking shape — to scrutinise whether businesses are meeting that bar.
The voice AI market is projected to reach $35.2 billion by 2033, growing at 39% annually. GPT-Realtime-2 isn't the only horse in the race, but it's the first to combine reasoning, tool use, translation, and transcription in a single integrated stack at aggressive pricing. For businesses that have been waiting for voice AI to get smart enough to trust with real customers — the waiting is over.
Sources
- Advancing voice intelligence with new models in the API — OpenAI
- GPT-Realtime-2: OpenAI's voice agents finally got a brain — The Neuron Daily
- Gartner Predicts Conversational AI Will Reduce Contact Center Agent Labor Costs by $80 Billion in 2026 — Gartner
- AI Voice Agents Market Size, Share — Industry Report, 2033 — Grand View Research
- Best Realtime Voice AI APIs for Agents in 2026 — APIScout
- The State of Voice Agents in 2026 — AI Voice Research
