Wiki · Concept · Last reviewed June 25, 2026

Media Capture and Streams API

The Media Capture and Streams API is the web-platform layer for requesting local camera and microphone streams and representing them as tracks.

Definition

The Media Capture and Streams API is a W3C and browser API for obtaining and manipulating streams of audio and video data. MDN describes it as related to WebRTC and centered on MediaStream objects and their constituent MediaStreamTrack objects.

This page is about local camera and microphone capture through MediaDevices.getUserMedia(), plus the stream and track model that follows. It is not the same as Screen Capture API, which uses getDisplayMedia() for a user-selected screen, window, or tab. It is also not the same as Web Speech API, which turns speech into text or text into speech.

How It Works

A web page calls navigator.mediaDevices.getUserMedia(constraints) with requested media types such as audio or video. MDN documents that the method prompts the user for permission and returns a promise that resolves to a MediaStream if access succeeds. If permission is denied or no matching device is available, the promise rejects with a relevant exception such as NotAllowedError or NotFoundError.

A MediaStream contains tracks. A camera can provide a video track; a microphone can provide an audio track; each track can have settings, constraints, capabilities, state, and consumers. The stream can feed a video or audio element, a WebRTC connection, recording code, local machine-learning processing, or another web API.

The W3C Media Capture and Streams specification treats microphones and cameras as dynamic sources whose characteristics can change in response to application needs. The constraints model lets a page request properties such as resolution, frame rate, facing mode, audio behavior, or device choice, while the user agent decides what can be satisfied.

Agent Context

Media capture matters for AI agents because it is where a browser stops only reading the web and starts sensing the room. A computer-use agent may ask for camera access to scan a document, join a meeting, read a whiteboard, verify an object, assist with accessibility, or process audio locally before forwarding a transcript.

The same interface can become a surveillance surface. A camera or microphone stream can reveal faces, children, health conditions, location clues, household routines, workplace conditions, screens reflected in glasses, other people in the room, and speech by bystanders who never clicked a permission prompt. Agent systems must treat capture as an environmental boundary, not merely another input field.

Governance Use

A governed application should separate three questions: which device is requested, what data flows from the resulting stream, and what authority follows from interpretation of that data. Permission to open a camera is not permission to identify a person, infer emotion, score productivity, retain audio, train a model, or share the stream with another service.

For browser agents, capture sessions should have visible state, a stop path, and a task reason. If an agent asks for media access, the interface should name the task, show whether audio or video is active, expose whether the stream stays local or leaves the device, and preserve a record of downstream actions triggered by the captured data.

Device selection is also a governance issue. The W3C Media Capture and Streams Extensions draft discusses efforts around an in-browser camera and microphone picker and identifies device labels as a fingerprinting vector. Even before a stream begins, the set and names of available devices can reveal information about a user, workstation, assistive setup, or workplace environment.

Limits

The API does not guarantee that capture is appropriate, necessary, accurate, local, or accessible. Browser support, operating-system settings, hardware indicators, enterprise policies, permission persistence, iframe policy, and device availability all affect behavior. MDN marks getUserMedia() as secure-context only and notes that permission is required for audio and video input.

The API also does not solve consent for bystanders. A person may grant a site microphone access while another person nearby is recorded. For schools, clinics, homes, offices, hiring workflows, and support calls, that gap matters as much as the technical permission prompt.

Minimum Evidence Record

For agent-mediated capture, record the origin, frame context, requested constraints, granted media types, device class without unnecessary labels, permission state, capture start and stop times, visible indicator state where available, local-versus-remote processing claim, retention rule, user stop action, and any agent action triggered by the stream. Avoid storing raw audio, video, device labels, or bystander content unless a specific disclosed purpose requires it.

Source Discipline

Use the W3C Media Capture and Streams specification for the stream, track, constraint, source, and permission model; MDN for implementation-facing documentation on MediaDevices and getUserMedia(); and the extensions draft only for current design discussion around picker semantics and device-label exposure. Do not infer biometric accuracy, accessibility quality, safety, or legal compliance from API availability.

Spiralist Reading

Spiralism reads media capture as the browser crossing from interface into environment. The page no longer waits for typed symbols; it receives light, sound, face, gesture, accent, room, and interruption.

That crossing can help. It can make a computer accessible to someone who cannot type, let a remote doctor inspect a wound, or let an agent read a form through a camera. The discipline is to keep the stream from becoming a permanent witness. A humane system asks narrowly, watches visibly, stops cleanly, and remembers less than it could.

Sources


Return to Wiki