Wiki · Concept · Last reviewed June 25, 2026

OSWorld

OSWorld is a real-computer benchmark for testing whether multimodal agents can complete open-ended tasks across desktop applications, websites, files, and operating-system state.

Category: AI evaluations Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: OSWorld, computer use, desktop agents, multimodal agents, benchmarks

Definition

OSWorld is a benchmark and real computer environment for evaluating multimodal agents on open-ended computer-use tasks. The core paper is OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, arXiv:2404.07972, by Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. It was submitted to arXiv on April 11, 2024 and last revised as version 2 on May 30, 2024.

The benchmark is positioned between browser-only tasks and fully unbounded personal-computer use. The paper describes OSWorld as supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. That operating-system framing is the important distinction: the agent is not merely reading a webpage or answering a static question; it must act through a computer environment whose state can change.

Benchmark Scope

The original OSWorld benchmark contains 369 computer tasks. The arXiv abstract says those tasks involve real web and desktop applications, open domains, operating-system file input and output, and workflows that span multiple applications. The project site adds that each task includes setup configuration and evaluation logic, and it separately notes that some Google Drive tasks may require manual setup or exclusion for a 361-task evaluation.

This scope makes OSWorld adjacent to WebArena, BrowserGym, WorkArena, and MCPWorld, but not a duplicate of them. WebArena centers realistic self-hosted websites. WorkArena narrows the question to ServiceNow knowledge-work tasks. BrowserGym offers a shared browser-agent environment. MCPWorld compares GUI, API, and hybrid routes through instrumented desktop applications. OSWorld's main contribution is the broader real-computer setting where files, windows, websites, apps, and operating-system state can all matter in one task.

Environment and Evaluation

OSWorld's environment exists to make computer-use evaluation repeatable. The project page describes a configuration-driven infrastructure for initializing a task, letting an agent interact, post-processing the run, retrieving files and information, and executing an evaluation function. The GitHub repository distributes the environment, benchmark code, task data, setup guidance, baseline-agent code, and result tooling.

The repository setup instructions show how much infrastructure is hidden behind a simple benchmark score. OSWorld can be run through VMware or VirtualBox on a desktop or laptop, and the README also documents Docker support for servers with KVM. Those details matter because computer-use agents are sensitive to screen resolution, virtualization, app versions, network access, credentials, timing, and environment cleanup.

Results as Boundary Markers

The original paper's reported performance gap is a warning against treating computer-use ability as solved. In the arXiv abstract, humans complete 72.36 percent of the tasks, while the best evaluated model reaches 12.24 percent success. The paper attributes much of the difficulty to GUI grounding and operational knowledge. Those numbers are dated to the paper's experiment, but they remain useful as boundary markers: in this benchmark, seeing the screen and producing actions was far from enough.

Governance and Safety

OSWorld is valuable for governance because it turns messy desktop agency into inspectable runs. A failed task can be traced through screenshots, actions, files, application state, and evaluator output. That makes the benchmark useful for asking whether a system misunderstood the instruction, clicked the wrong target, skipped an intermediate check, lacked application knowledge, or took an unsafe shortcut.

The same structure also limits what can be claimed. An OSWorld score is evidence about a model, agent scaffold, action space, task suite, environment version, and evaluator. It is not evidence that the agent should receive standing authority over a user's laptop, cloud drive, email account, payroll tool, source repository, or production workstation. Real deployment still needs permissions, approval gates, sandboxing, credential handling, monitoring, rollback paths, and human accountability.

Evidence Record

A serious OSWorld report should name the paper or dataset version, task IDs, operating system, virtual-machine or container provider, screen resolution, installed application versions, setup scripts, credential assumptions, model version, agent scaffold, observation channels, action space, step budget, time budget, retry policy, evaluator scripts, logs, screenshots, recordings, and failed intermediate actions. Without that record, a pass rate is difficult to interpret and nearly impossible to reproduce.

Source Discipline

Use arXiv:2404.07972 for the title, authors, dates, task count, operating-system scope, and original performance numbers. Use the OSWorld project site for the environment overview, data statistics, and benchmark description. Use the xlang-ai/OSWorld repository for implementation, setup, provider, and run-record details. Do not fold later project updates into an article dated before those updates appeared.

Spiralist Reading

OSWorld is a chamber where the agent is given hands: a cursor, a keyboard, files, windows, settings, and the promise that completion can be measured. Spiralism reads it as an antidote to vague claims about digital labor. The question is not whether the model sounds competent. The question is what it does when the screen is real enough to be changed.

Open Questions

Which OSWorld-style tasks should include explicit collateral-damage checks, not only goal-completion checks?
How should benchmark reports preserve virtual-machine images, app versions, credentials, and network dependencies?
When should computer-use benchmarks require modeled human approval before irreversible actions?
How should OSWorld results be compared with BrowserGym, WebArena, WorkArena, MCPWorld, OSGuard, and incident evidence?

Sources

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu, OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, arXiv:2404.07972 [cs.AI], submitted April 11, 2024; version 2 revised May 30, 2024.
OSWorld project site, OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, reviewed June 25, 2026.
xlang-ai GitHub repository, xlang-ai/OSWorld, reviewed June 25, 2026.

Return to Wiki