Voice AI drops the transcript

The STT→LLM→TTS pipeline is structurally lossy — it transcribes tone, hesitation, and interruption intent into silence. This brief explains what native audio reasoning replaces it with, compares GPT-Realtime-2 and Gemini 3.1 Flash Live on benchmarks that predict real PM failure modes, and closes with three concrete decisions every voice-AI product team needs to make before launch.

The dominant architecture for voice AI over the past two years follows a strict relay: microphone audio goes to a speech-to-text (STT) model, the resulting text goes to a language model, the output text goes to a text-to-speech (TTS) engine. Three specialized systems, stitched together. It's expensive to operate, it introduces roughly 600ms of compounded latency in production, and it has one deeper structural problem that no amount of tuning can fix.1
When you transcribe speech to text, you lose the signal that tells you how something was said. A hesitation before "no" means something different from a flat "no." A rising inflection on "okay" means the caller is frustrated. An overlap where the user starts speaking mid-sentence is an interruption intent, not a data error. The STT layer strips all of it — and the language model downstream never knows it was there.
Two major model releases this week, plus two arXiv surveys posted in the past four days, formalize what the industry is calling the shift to native audio reasoning — where a single model processes raw audio end-to-end, never transcribing to text.23 The pipeline isn't being optimized. It's being replaced.

What native audio reasoning actually does

The term covers a cluster of related capabilities, defined last week in a taxonomy published by academic researchers on arXiv.2 Four categories:
  • Audio-to-text reasoning: the model analyzes acoustic input and generates text reasoning — e.g., "the caller is hesitating on the address, likely uncertain"
  • Audio-to-speech reasoning: the model receives audio and responds directly in audio, without transcribing at any point
  • Audio-visual reasoning: joint reasoning across audio and video streams simultaneously
  • Agentic audio reasoning: the model coordinates tools or subagents while staying in a voice conversation
For most product use cases today, the relevant shift is the second one: audio-to-speech. Instead of STT → text model → TTS, a single model takes raw audio in and generates audio out. It has direct access to prosody, tone, and the timing of the input — and can interrupt itself, detect when the user is about to interrupt, and adjust its response mid-stream.
The practical consequence: voice interactions stop being turn-based. The caller doesn't have to wait for the agent to finish speaking before being understood. The agent doesn't have to wait for a full pause before deciding to respond. This is the design space that the old architecture structurally cannot reach, regardless of how fast you make each individual component.4

The benchmark snapshot — May 2026

Two models now define the production frontier for native audio reasoning.
OpenAI GPT-Realtime-2, released May 7, is the first speech-to-speech model built on the GPT-5 reasoning stack.5 It offers five levels of adjustable reasoning effort (minimal / low / medium / high / xhigh), with the minimal setting achieving a first-audio-chunk latency of 1.12 seconds — trading depth of reasoning against response speed. The context window expanded from 32K to 128K tokens.5
Google Gemini 3.1 Flash Live, released March 27, is Google's native audio reasoning model available via the Gemini Live API and Google AI Studio.6 It supports 200+ countries and marks SynthID watermarking on all audio output — a notable differentiator for regulated environments.6
OpenAI's three patterns for building with voice AI: Voice-to-action, Systems-to-voice, and Voice-to-voice
The three ways teams build with voice AI today — the rightmost column (Voice-to-voice) is where native audio reasoning operates. 5
The benchmarks are close but not identical:
BenchmarkGPT-Realtime-2Gemini 3.1 Flash LiveWhat it measures
Big Bench Audio96.6%95.9% (high-thinking)General audio language understanding
Conversational Dynamics96.1% (#1)Not disclosedInterruption handling, turn-taking
ComplexFuncBench AudioNot disclosed90.8% (vs. 71.5% prior gen)Multi-step tool calls in a voice context
Pricing$32/$64 per 1M audio in/out tokens$0.50/$3.00 per 1M text/audio in tokens (Live API)
Sources: 5678
Loading link preview…
On production deployment, Zillow reported a call success rate increase from 69% to 95% after switching to GPT-Realtime-2 — a 26-point lift on its hardest adversarial benchmark.5 Gemini 3.1 Flash Live's multi-step tool call accuracy of 90.8% in ComplexFuncBench Audio — compared to 71.5% from its prior generation — suggests this is the more relevant benchmark for agent-heavy voice flows where the model needs to retrieve, book, or update records mid-conversation.6

The PM implementation path

Three decisions this evidence clarifies:
1. Match the benchmark to your use case, not the headline number. Big Bench Audio measures language understanding. If your voice product's primary failure mode is "agent said it did something but didn't call the tool," ComplexFuncBench Audio is the number to watch — and Gemini 3.1 Flash Live's 90.8% on that benchmark is the most recent evidence available.6 If your failure mode is "agent doesn't handle interruptions gracefully," GPT-Realtime-2's Conversational Dynamics lead matters more.5
2. Set your reasoning budget intentionally. GPT-Realtime-2's five-level reasoning effort is not a quality dial to max out. At xhigh, you get deeper reasoning and higher latency. At minimal, you get 1.12s first audio with shallower processing. The right default is low for conversational flows and medium for tool-heavy transactions — high and xhigh are for edge cases where deliberation speed is less critical than accuracy.8
3. Build your eval harness around the four failure modes before you ship. Chier Hu, an engineer at Sierra (a customer AI company), identifies the four failure types that lab benchmarks don't capture: tool hallucination (the model says "I've updated your record" but never called the function), conversational confusion (treating "uh-huh" as a command), logical missteps (canceling the wrong flight because tool reasoning and conversation reasoning aren't integrated), and spell-out failure (mishearing a letter during authentication).9 "A voice agent does not get to pick its audio. It has to earn reliability inside the audio it receives."9 Your eval environment needs background noise, interrupted sentences, and adversarial inputs — not a quiet room with clean audio.

TL;DR

  • What changed: STT→LLM→TTS pipelines are being replaced by models that reason directly on raw audio, preserving tone, hesitation, and interruption signals that transcription destroys2
  • The production frontier: GPT-Realtime-2 (96.6% Big Bench Audio, 96.1% Conversational Dynamics #1, +26pp at Zillow) and Gemini 3.1 Flash Live (90.8% ComplexFuncBench Audio, strong multi-step tool calling) are within 1pp on general benchmarks but diverge on task type56
  • The move: pick your benchmark by failure mode (interruption handling → GPT-Realtime-2; multi-step tools → Gemini 3.1 Flash Live), set reasoning effort at low as your default and tune up only for transactions, and test in adversarial audio conditions before launch

Add more perspectives or context around this Drop.

  • Sign in to comment.