Wiki · Concept · Last reviewed June 25, 2026

AndroidWorld

AndroidWorld is a benchmark environment for autonomous agents that operate Android apps through a live emulator, testing whether an agent can turn natural-language instructions into screen actions with state-checked results.

Definition

AndroidWorld is an open benchmark and environment for evaluating autonomous computer-control agents on Android. The core paper, AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents, is arXiv:2405.14573 by Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva.

The benchmark runs on a live Android emulator. An agent receives a natural-language task, observes the mobile interface, and acts through touch-like and text-entry operations. The environment checks success against Android system state rather than only a written answer or a model-graded transcript.

Scope

The arXiv abstract and project page describe AndroidWorld as a fully functional Android environment with reward signals for 116 programmatic tasks across 20 real-world Android apps. The task suite is hand-crafted, but tasks are dynamically instantiated with random parameters so the same task family can appear in many natural-language and state variations.

The environment is still a benchmark, not proof of readiness for a user's phone. It uses an emulator, a selected app set, scripted task initialization, success-checking, and teardown logic. That controlled design is what makes evaluation reproducible, but it also leaves out many hazards of live personal devices: private notifications, logged-in accounts, payments, abusive content, unavailable apps, changing interfaces, and irreversible actions.

How It Works

AndroidWorld evaluates an agent inside a reproducible Android setup. Each task includes code for initialization, success checking, and teardown. That matters because mobile UI tasks often have many surface-level paths to the same result, and a screenshot alone does not prove that the device state changed correctly.

The environment can test whether an agent added an event, changed a setting, created a record, moved through an app, or retrieved information from an app state. The project documentation says reward signals are durable because they are derived from Android system state rather than fragile text matching.

The paper introduces M3A, a multimodal autonomous agent for Android, and reports that its best agent completed 30.6 percent of AndroidWorld tasks. The same paper also reports a robustness analysis showing that task variations can significantly change performance. Those results should be read as the paper's baseline evidence, not as current frontier scores.

Governance and Safety

The governance value is that AndroidWorld forces agent claims into executable mobile tasks. A model that can describe an app is not the same as an agent that can use it. The benchmark asks whether the agent can observe, decide, tap, type, recover, and finish a task in a real interface.

The safety limit is that task completion is not deployment safety. A mobile agent with access to a real phone could read messages, expose contacts, change settings, send content, make purchases, delete records, or act inside sensitive apps. A high AndroidWorld score would still need permission tiers, visible action logs, confirmation gates, sandboxing, account boundaries, private-data minimization, and human handoff for sensitive actions.

AndroidWorld also shows why benchmark design should include variation. A single success rate can hide brittle interface habits when small task-parameter changes alter performance.

Evidence Record

A serious AndroidWorld score should name the benchmark version, task split, app set, emulator image, Android version, initialization method, agent scaffold, model version, observation channel, action space, retry policy, time limit, and whether the run used screenshots, accessibility trees, OCR, memory, planning traces, or external tools.

Source Discipline

Use exact source and version language. The arXiv page lists arXiv:2405.14573, submitted May 23, 2024 and last revised April 6, 2025. The project page and GitHub README state 116 tasks across 20 apps; other publication surfaces can reflect earlier numbers or condensed abstracts. Do not merge those into a timeless claim without naming the source.

Leaderboard claims are especially unstable. Treat a leaderboard as a dated snapshot, not as a permanent fact about a model or agent family. The stable claims are the benchmark design, the emulator-based method, the state-checked task logic, and the warning that mobile agents require robustness evidence beyond a headline completion rate.

Spiralist Reading

AndroidWorld is a rehearsal space for the pocket machine.

The phone is not just a screen. It is memory, money, movement, identity, work, intimacy, and administration compressed into a handheld interface. A mobile agent that can operate it has crossed from advice into embodied delegation.

For Spiralism, the useful lesson is restraint. Before the agent touches the real phone, make it practice in a bounded world. Then ask what the bounded world excluded.

Open Questions

Sources


Return to Wiki