Wiki · Concept · Last reviewed June 25, 2026

Web Speech API

The Web Speech API is the browser-facing JavaScript interface for speech recognition and speech synthesis in web applications.

Category: Concept Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: browser API, speech recognition, speech synthesis, voice agents, accessibility, surveillance

Definition

The Web Speech API defines browser interfaces for incorporating speech recognition and speech synthesis into web pages. The W3C-hosted specification describes a JavaScript API for generating text-to-speech output and for using speech recognition as input for forms, dictation, and control flows.

This page is about the browser API, not about voice assistants as products, large-vocabulary speech-recognition research, robocall authentication, or AI voice fraud. Those topics overlap, but the Web Speech API is narrower: it is an application interface for turning speech into input and text into audible output inside a web context.

Two Halves

MDN summarizes the API as two parts: SpeechRecognition for asynchronous speech recognition and SpeechSynthesis for text-to-speech. Recognition receives speech from an audio source, often a microphone or track, and returns recognition results. Synthesis uses SpeechSynthesis, SpeechSynthesisVoice, and SpeechSynthesisUtterance objects to speak text through an available speech service.

The specification is intentionally agnostic about implementation. It can support server-based recognition or client-side and embedded recognition. That matters for governance because the same API call may mean very different data flows depending on browser, operating system, language pack, and settings. MDN notes that in some browsers, including Chrome, speech recognition can involve a server-based recognition engine, so audio may be sent to a web service and may not work offline.

The two halves also have different deployment maturity. As reviewed on June 25, 2026, MDN marked SpeechRecognition as limited availability because it does not work in some widely used browsers, while SpeechSynthesis was marked Baseline and widely available. A serious implementation should therefore treat recognition and synthesis as separate capabilities, not one uniform "voice support" feature.

Agent Context

Voice is a tempting interface for agents because it lowers friction. A user can dictate a command, ask for a summary, approve a task while away from the keyboard, or receive spoken status from a browser assistant. That makes speech useful for accessibility, mobility, hands-free work, language learning, and conversational support.

The same convenience increases exposure. A browser agent that can listen, transcribe, and speak sits close to identity, disability access, children, accents, home life, workplace monitoring, and medical or financial tasks. A transcript can become a prompt, a search query, a training artifact, an audit record, or a customer-service evidence file. The governance issue is not whether speech is natural. It is whether the interface makes capture, retention, and downstream action visible.

Governance Use

For agentic browsers and web applications, speech should be treated as a privileged input mode. Starting recognition should require a user-understandable action, visible state, clear stop behavior, and a path to inspect or discard the transcript before consequential actions are taken.

Speech synthesis needs boundaries too. Spoken output can influence urgency and authority differently from text. A system voice should not obscure whether a statement came from a model, a human, a policy rule, or stored content. When an agent reads out a recommendation, denial, price, medical instruction, or workplace directive, the record should preserve the underlying source and not only the audio event.

Permissions Policy adds another layer for newer on-device recognition features. MDN documents an on-device-speech-recognition directive that can control access to local recognition installation and availability checks. That is a reminder that voice features are not only user-interface features; they are delegation points between top-level pages, embedded frames, browsers, and operating-system services.

Limits

The Web Speech API is not a full audio-capture, transcription-quality, consent, or safety framework. It does not guarantee language coverage, equal accuracy across speakers, local processing, offline operation, or browser parity. It also does not decide whether a spoken command is authorized, whether a transcript is sensitive, or whether synthesized speech is appropriate for a given audience.

It should not be used as the sole evidence that a person intended an action. Speech recognition can mishear, segment poorly, ignore context, or confuse background speech. Agent systems should separate transcript capture from authorization, especially for purchases, account changes, employment decisions, medical triage, and legal or financial workflows.

Minimum Evidence Record

For governed use, record the origin, frame context, permission state, recognition start and stop events, language setting, local-versus-remote processing claim if exposed, transcript version shown to the user, user edits, action derived from the transcript, synthesized-output source, and retention rule. Avoid storing raw audio unless there is a specific, disclosed need and a deletion path.

Source Discipline

Use the Web Speech API specification for the interface model, MDN for implementation-facing browser documentation, and W3C Community Group records for standards status context. Do not infer speech-recognition accuracy, accessibility adequacy, privacy compliance, or agent safety from API availability alone.

Spiralist Reading

Spiralism reads voice as intimacy at the interface. Speech feels closer to the body than a form field. A browser that listens and speaks can make computing more humane, especially for people excluded by keyboard-first design.

But intimacy can become extraction. The ethical discipline is to keep the voice from becoming a hidden contract. The user should know when the machine is listening, what it heard, what it will do with the words, and whether silence is still possible.

Sources

W3C Community Group, Web Speech API.
MDN Web Docs, Web Speech API.
MDN Web Docs, SpeechRecognition.
MDN Web Docs, SpeechSynthesis.
W3C, Speech API Community Group.

Return to Wiki