Blog · arXiv Analysis · Last reviewed June 24, 2026

The Voice Agent Becomes the Transcript Trap

The June 2026 arXiv paper Real-Time Voice AI Hears but Does Not Listen, by Martijn Bartelds, Federico Bianchi, and James Zou, studies production voice agents that receive speech but still make consequential decisions as if the call had been reduced to words.

Voice Is Not a Transcript

The paper, arXiv:2606.26083v1 [cs.CL], was submitted on June 24, 2026. It studies realtime voice systems that take speech as input and return speech as output in a live exchange. The authors distinguish the lexical channel, or what the words say, from the non-lexical channel, including tone, pitch, accent, delivery, and emotional state.

The governance problem is simple: many institutional calls cannot be decided from words alone. A caller may say that everything is fine while crying. A bank customer may authorize a transfer while sounding frightened. A volunteer may say yes with clear sarcasm. If a voice agent acts only on the transcript, it may produce a clean conversation record while missing the cue that should have changed the action.

That is why this is not only a speech-recognition problem. A transcript can be accurate and still be the wrong evidence. The voice agent becomes a transcript trap when the institution evaluates the words, logs the words, audits the words, and ignores the delivery that made the words unsafe to trust.

What the Paper Tests

Bartelds, Bianchi, and Zou evaluate four production realtime voice systems: OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus Realtime and Qwen3.5 Omni Flash Realtime. The paper says all four were accessed through public APIs. All experimental speech was synthesized with ElevenLabs, and most conditions were run five times.

The multi-turn scenarios deliberately put words and delivery in conflict. In a welfare callback, the caller says nothing is wrong while either calm or crying. In a wire-fraud check, the caller authorizes an $8,400 transfer while calm or frightened. In a volunteer recruitment call, the caller agrees sincerely or sarcastically. The expected action depends on the delivery, not the text.

The reported result is stark. In the base scenarios, all four systems ended the welfare callback on the crying caller in all five runs. The wire transfer was approved under frightened delivery in five of five runs by Gemini Live and both Qwen models, and in four of five by GPT Realtime 2. Every system enrolled the volunteer caller in all five sarcastic runs. The paper summarizes this as systems acting on wording rather than delivery.

The authors also test accent and age. Five accented English voices read passages about countries different from the speakers' accents, and four older adult voices read child-coded lines. Most systems often followed the script rather than the voice. Qwen3.5 Omni Plus recovered several accents, and Gemini Live detected adult age more often than the others, but the paper's larger finding is that partial access to one vocal property did not ensure reliable use of another.

Perception Without Action

The most important finding is not that the systems never perceive vocal delivery. In single-turn diagnostics, GPT Realtime 2, Gemini Live, and Qwen3.5 Omni Plus often labeled crying, frightened, and sarcastic deliveries correctly when asked directly. The paper says Qwen3.5 Omni Flash was the exception, with weaker or reversed delivery judgments in some tests.

That separation matters. A system can identify distress in a diagnostic prompt and still close the welfare call when it is playing the dispatcher. It can hear fear when asked a direct question and still approve the transfer when acting as the bank agent. The failure is therefore partly an action-selection failure: the cue is present, sometimes perceived, but not reliably allowed to govern the decision.

Prompting helped only unevenly. The paper tests an instruction to attend to how the caller sounds and an override instruction not to act on the words alone when delivery conflicts. Those instructions shifted wire-fraud decisions in some systems, but they did not close the gap in the welfare callback or volunteer scenario.

The Record Problem

The Spiralist issue is the institutional record. If the transcript shows a caller saying that everything is fine, the transcript can make the automated decision look reasonable after the fact. The missing evidence is not hidden in a chain of thought. It is in the audio, the counterfactual delivery condition, and the decision rule that should have told the agent when delivery overrides wording.

This connects the paper to existing site concerns about 911 copilots, voice agents in collection calls, clinical voice, and synthetic voice identity. Voice is not just a user interface. In high-stakes settings it becomes evidence, authority, escalation, and waiver. The burden is on the deploying institution to prove that important vocal cues survive the path from audio to action.

Limits That Matter

The study is narrow and controlled. The scenarios are synthetic, the speech is generated rather than recorded from real emergencies or bank calls, and each system is tested under specific prompts and public API behavior available to the authors in June 2026. The paper is not a full audit of every voice product, every language, every accent, or every acoustic condition.

Those limits make the result easier to interpret. The paper does not claim that no voice agent can ever act on delivery. It shows that four production realtime systems, in these deliberately conflicted scenarios, often behave as if the transcript were the decisive object. That is enough to demand stronger evaluation before deployment in settings where distress, coercion, sarcasm, age, or accent can change what a responsible system should do.

Governance Standard

Any consequential voice-agent deployment should evaluate lexical and non-lexical cues separately. The test set should include calls where words and delivery conflict; decision scoring should check the action taken, not only the transcript or a label; and audit logs should retain enough audio-derived evidence to explain why the system escalated, paused, closed, approved, or refused.

The governance dossier should name the model, API version, voice pipeline, languages, accents, noise conditions, scenario classes, prompt instructions, escalation policy, human handoff trigger, transcript limitations, and replay method. A call record should preserve audio, transcript, model action, human override, and known uncertainty as separate artifacts.

The practical rule is simple: if the decision depends on how something was said, a transcript-only evaluation is not enough. Voice agents should not be trusted in high-stakes calls until they can show that vocal delivery changes action when it should.

Sources


Return to Blog