The Voice Prompt Becomes the Safety Gap
Beatrice Savoldi, Sara Papi, Wafa Aissa, Matteo Negri, and Luisa Bentivogli's June 2026 arXiv paper shows that speech-model safety cannot be checked only with English text prompts. The voice itself changes the testing surface.
From Text Prompt to Voice Prompt
The paper, arXiv:2606.26968 [cs.CL], is titled RedVox: Safety and Fairness Gaps in Speech Models Across Languages. arXiv lists Beatrice Savoldi, Sara Papi, Wafa Aissa, Matteo Negri, and Luisa Bentivogli as the authors and records version 1 on June 25, 2026.
The paper begins from a gap in evaluation practice. Speech-capable models are being used for direct spoken interaction, but the authors' survey of 38 speech model releases found that only 8 percent documented any multilingual analysis, and only 11 of the 38 documented safety evaluation at all. Speech adds a second under-tested surface: language and modality at once.
This is a fresh companion to the accent-filter labor essay, the voiceprint identity essay, and the machine-interpreter language-gate entry. Those pages ask how voice becomes a credential, mask, or service boundary. RedVox asks whether safety policy still works when the prompt is heard rather than read.
What RedVox Tests
RedVox is a multilingual safety and fairness benchmark for audio and speech built with natural voices. It covers English, French, Italian, Spanish, and German. The target requests come from two textual resources: SHADES for stereotypical generalizations and M-ALERT for unsafe requests. The authors adapt those materials into speech and audio testing conditions without reproducing a synthetic-only voice setup.
The benchmark separates two request types. In the Speech condition, the problematic content is spoken and accompanied by a short textual follow-up. In the Audio condition, the problematic content is in text while the audio track contains non-speech sound such as silence or noise. This lets the study compare text-only, audio-plus-text, and spoken-content settings.
The paper reports 6,118 collected entries totaling almost 10 hours of audio and speech. Because only half of participants consented to public data release, the released RedVox subset contains 26 unique voices and 3,414 unique entries. The authors report that model ranking on the released subset closely preserved the full collection ranking, with Spearman's rho of 0.98 and p below .01.
What Changes in Speech
The authors evaluated eight systems that support the five languages: Qwen2-Audio, Phi4-Multimodal, Voxtral, Qwen3-Omni, Gemma 4, Gemini 3.1 Flash-Lite, Gemini 3.1 Pro-Preview, and GPT-realtime-2. They used an LLM-as-a-judge pipeline to label safety and fairness outcomes and checked a stratified sample with human annotators.
The result pattern is clear even without treating the benchmark as deployment evidence. Proprietary models showed the lowest unsafe response rates, at 3.1 percent or lower, while Qwen3-Omni was close at 3.4 percent. Voxtral produced fully harmful responses in roughly one out of four cases, and Phi4-Multimodal also showed elevated unsafe behavior.
Language changed the risk profile. English had a 5.1 percent unsafe rate, while the non-English languages combined had a 10.0 percent rate. The gap was especially driven by open models; the paper reports Voxtral reaching 28 percent unsafe responses in Spanish and French, an absolute increase of 15 points from English.
Modality changed it too. Speech was the most vulnerable setting, with controversial-plus-unsafe response rates reaching 10 to 44 percent depending on model and condition. Even non-speech audio paired with the same text could increase unsafe rates. A model can appear safer in text than when the same request arrives through a more natural interface.
The Data-Collection Problem
RedVox is also a paper about the cost of collecting evidence. It involved 52 researchers from seven European institutions, with voluntary participation and separate consent for taking part and for public release. That separation matters because a voice sample is not just a string. It can carry identity, accent, gender presentation, emotion, and reputational risk.
The authors' post-activity questionnaire found that releasing voice recordings containing harmful content was uncomfortable or very uncomfortable for 61.5 percent of participants. Participants also worried about being identified and associated with harmful material out of context. The paper treats gated access, customized licensing, decoupling from direct identifiers, and withdrawal rights as part of the benchmark's safety design.
Limits That Matter
The limits are substantial. RedVox covers five high-resource Indo-European languages, so it should not be read as evidence for typologically distant or underserved languages. It targets naturalistic, non-optimized requests rather than deliberate jailbreak strategies. It also uses simplified single-turn evaluation and does not test a condition where the full request is delivered only through speech.
The paper focuses on semantic content rather than speaker paralinguistics. It discusses accent and gender analyses in appendices, but the released subset limits statistical power. That restraint is important. Voice safety work can easily slide from "this model handles a language poorly" into unjustified claims about speakers. RedVox gives a useful testbed; it does not license speaker profiling.
The LLM-as-a-judge pipeline is another limit. The authors use human checks, but automated judgment still shapes the measurement. Any operational use of this benchmark should preserve raw model outputs, label rules, human adjudication samples, and uncertainty rather than converting the result into a single safety grade.
Governance Standard
A speech model safety report should state the tested languages, whether voices were natural or synthetic, whether the unsafe content was spoken or written, whether non-speech audio was present, which model versions were tested, and how borderline stereotype responses were labeled. Text-only red-teaming is no longer enough for a voice product.
The same report should include a participant-protection record: consent scope, release consent, withdrawal path, data gating, identifier removal, voice reuse limits, and rules against using the dataset for speaker identification or harassment. Safety evidence built from human voices needs governance for the people who provided those voices.
The RedVox lesson is not that one benchmark can certify multilingual speech safety. It is that the voice prompt is its own safety condition. When an assistant listens, the policy boundary should be tested in the language, modality, and consent regime where real users will meet it.
Sources
- Beatrice Savoldi, Sara Papi, Wafa Aissa, Matteo Negri, and Luisa Bentivogli, RedVox: Safety and Fairness Gaps in Speech Models Across Languages, arXiv:2606.26968 [cs.CL], version 1 submitted June 25, 2026.
- arXiv PDF: RedVox: Safety and Fairness Gaps in Speech Models Across Languages, reviewed for the 38-model reporting survey, RedVox language coverage, dataset construction, eight-model evaluation, multilingual and modality results, participant consent, gated release, and limitations.