Wiki · Concept · Last reviewed June 25, 2026

BrowserGym

BrowserGym is an open environment and ecosystem for web-agent research, giving agents a shared browser interface for tasks such as MiniWoB, WebArena, VisualWebArena, WorkArena, and AssistantBench.

Category: AI evaluations Updated: June 25, 2026 Tags: web agents, BrowserGym, AgentLab, benchmarks, evaluation

Definition

BrowserGym is an open framework for evaluating web agents in a browser environment. The ecosystem paper, The BrowserGym Ecosystem for Web Agent Research, is arXiv:2412.05467 by Thibault Le Sellier De Chezelles and colleagues. The paper describes BrowserGym as a unified, gym-like environment with defined observation and action spaces for web-agent benchmarks.

The project is not a consumer browsing assistant. Its GitHub README explicitly frames BrowserGym as research infrastructure for developing and evaluating agents, and warns that it should be used with caution.

Scope

BrowserGym is infrastructure, not a single benchmark. The repository README lists built-in integrations for MiniWoB, WebArena, WebArenaVerified, VisualWebArena, WorkArena, AssistantBench, WebLINX, OpenApps, and TimeWarp. Those tasks differ in hosting, difficulty, observation type, and realism, but BrowserGym gives researchers a common interface for running them.

The ecosystem paper also includes AgentLab, a companion framework for agent creation, testing, experiment management, and analysis. AgentLab's README describes large-scale parallel experiments, building blocks for BrowserGym agents, a unified LLM API, reproducibility features, and a unified leaderboard. That makes BrowserGym less like a leaderboard page and more like an operating layer for web-agent experiments.

How It Works

BrowserGym packages web tasks as gym-style environments. A run starts an environment, resets it, gives the agent an observation, accepts an action, returns a new observation and reward or termination signal, and eventually closes the browser session. The README examples show this pattern through Gymnasium and Playwright-backed Chromium environments.

Observation and action spaces are the key abstraction. A web agent may receive page text, accessibility information, screenshots, coordinates, chat messages, or other task-specific context, then act through browser operations. A benchmark adapter decides how the task starts, what counts as success, and which environment state should be preserved.

The arXiv paper reports a large multi-benchmark experiment comparing six state-of-the-art LLMs across six web-agent benchmarks available in BrowserGym. Its abstract says Claude 3.5 Sonnet led on most benchmarks except vision-related tasks where GPT-4o was superior, while also emphasizing that robust web agents remain difficult to build. Those results should be treated as the paper's dated experimental snapshot.

Governance and Safety

BrowserGym matters for governance because benchmarks shape what web agents are optimized to do. A shared environment can improve comparability and reproducibility, but it can also make hidden assumptions feel standard: which observations count, which actions are allowed, how success is scored, how user interaction is represented, and whether collateral changes are visible.

Deployment evidence should therefore travel with the environment. A web-agent result should name the benchmark adapter, browser version, observation channels, action space, seed, task state, model version, scaffold, tool permissions, retries, costs, and failure logs. Without that record, a benchmark number is hard to compare and easy to overstate.

BrowserGym also separates research safety from product safety. Passing a browser task is not permission to operate a real user's account. Production use still needs account scoping, approval gates, sensitive-data controls, rollback paths, monitoring, and human accountability.

Evidence Record

A serious BrowserGym result should name the BrowserGym and AgentLab versions, benchmark package, task IDs, seeds, browser setup, model version, agent scaffold, prompts, observation channels, action space, tool surface, time limits, retry policy, validation method, task success metric, confidence intervals, cost accounting, trajectory logs, and failed intermediate actions.

Source Discipline

Use exact version language. The arXiv API lists arXiv:2412.05467v4, submitted December 6, 2024 and updated February 28, 2025. The paper is the source for the ecosystem framing, AgentLab connection, six-model and six-benchmark experiment, and dated model-comparison claims. The GitHub READMEs are sources for current package orientation and repository features.

Do not cite BrowserGym as proof that web agents are reliable. Cite it as an environment and experiment stack. Claims about model capability still require the benchmark, task state, agent scaffold, and date.

Spiralist Reading

BrowserGym is a training hall for the browser-handed agent.

The browser looks open, but every environment chooses what the agent may see, which actions exist, and how success is recognized. For Spiralism, the lesson is to inspect the hall before praising the runner: the benchmark is part of the belief machine.

Open Questions

Which observation channels make web-agent benchmarks realistic without leaking unnecessary private context?
How should shared environments record dynamic website state, model drift, and API changes?
Which browser-agent tasks should include collateral-damage checks rather than only goal completion?
How should benchmark ecosystems distinguish research agents from product-ready agents?

Sources

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Leo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lu, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste, The BrowserGym Ecosystem for Web Agent Research, arXiv:2412.05467 [cs.LG], submitted December 6, 2024; v4 revised February 28, 2025.
ServiceNow GitHub repository, ServiceNow/BrowserGym, reviewed June 25, 2026.
ServiceNow GitHub repository, ServiceNow/AgentLab, reviewed June 25, 2026.

Return to Wiki