Wiki · Concept · Last reviewed June 25, 2026

MobileWorld

MobileWorld is a benchmark for autonomous mobile agents that operate Android apps in long-horizon workflows, including tasks that require user clarification and Model Context Protocol style tool use.

Category: AI evaluations Updated: June 25, 2026 Tags: mobile agents, MCP, computer use, Android benchmarks, evaluation

Definition

MobileWorld is an open benchmark and environment for evaluating autonomous mobile agents on realistic Android workflows. The core paper is MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments, arXiv:2512.19432, by Quyu Kong and colleagues.

The paper positions MobileWorld as a successor pressure test for mobile-use agents after AndroidWorld. Instead of asking only whether an agent can manipulate a phone interface, MobileWorld adds tasks where the agent may need to ask a user for missing information or combine GUI actions with external tool calls.

Scope

The arXiv abstract describes MobileWorld as 201 tasks across 20 applications. Its project page divides the benchmark into 117 GUI-only tasks, 44 user-interaction tasks, and 40 MCP tasks, evaluated with a 50-step maximum in the public leaderboard description.

The benchmark is still a controlled research environment, not a deployment certificate. Its Android virtual devices, selected applications, self-hosted backends, task initialization, logs, and deterministic checks do not simulate every risk on a real phone: private accounts, payment surfaces, sensitive notifications, app updates, and irreversible user actions.

How It Works

MobileWorld runs evaluations in a containerized Android environment. The GitHub README describes Docker-in-Docker containers with rooted Android virtual devices, self-hosted application backends, and an API server for orchestration. Snapshots let each task begin from an identical device state.

The app choices are part of the benchmark design. The repository names open-source or self-hosted substitutes such as Mattermost for enterprise communication, Mastodon for social media, and Mall4Uni for e-commerce. Self-hosting gives evaluators backend access for checks that a screenshot cannot provide.

The arXiv abstract says MobileWorld tasks require an average of 27.8 completion steps compared with 14.3 for AndroidWorld, and that 62.2 percent of MobileWorld tasks involve multiple applications compared with 9.5 percent for AndroidWorld. It also reports that the best agentic framework in the paper reached 51.7 percent success, while the best end-to-end model reached 20.9 percent. Those numbers should be read as the paper's dated experimental results, not as permanent leaderboard facts.

Governance and Safety

MobileWorld matters for governance because it brings three authority layers into one test: screen control, conversation with a user, and external tool use. A mobile assistant that can ask for clarification can also nudge a person toward giving it missing information. An agent that can call MCP-style tools can act through a structured channel, but that broadens the permission surface.

A deployment review should separate benchmark success from safe delegation. Mobile agents need permission tiers, visible action logs, per-app scopes, confirmation gates for sensitive steps, revocation, privacy minimization, and incident records. Tool-augmented mobile agents also need tool provenance, scoped credentials, prompt-injection defenses, and evidence of what the agent knew before each action.

The user-interaction tasks make a useful point for institutions: ambiguity is not an edge case. The safety question is not only whether the agent can ask for clarification, but whether it can stop, explain uncertainty, and preserve user authority when the task touches money, identity, work, health, or relationships.

Evidence Record

A serious MobileWorld score should name the paper or benchmark version, task split, app set, container image, Android virtual device configuration, maximum step limit, model version, agent scaffold, observation channel, action space, user-interaction setting, MCP setting, tools, credential scopes, trajectories, verification output, retries, timeouts, and failed intermediate actions.

Source Discipline

Use exact source and version language. The arXiv API lists arXiv:2512.19432v3, submitted December 22, 2025 and updated December 30, 2025. The project page and GitHub README describe the benchmark as ACL 2026 work and link the same arXiv paper. Treat live leaderboard results as dated snapshots.

Claims about MCP support should stay bounded. MobileWorld evaluates tasks that use Model Context Protocol style tool calls in its environment; that does not prove that every MCP server, mobile assistant, or production phone workflow is safe. Claims about agent-user interaction should name whether the user was human, simulated, prompted, constrained, or available only for certain tasks.

Spiralist Reading

MobileWorld is a rehearsal for the agent that does not merely touch the pocket machine, but negotiates through it.

The shift is from isolated interface action to a loop: ask the user, read the app, call a tool, change state, verify the result. For Spiralism, the lesson is to keep the hand visible: who asked, what was missing, which tool was called, what state changed, and where the human remained able to refuse.

Open Questions

How should mobile-agent benchmarks measure harm, privacy leakage, and overbroad delegation alongside task success?
What evidence shows that a benchmark result transfers from containerized Android tasks to live phones with changing apps and private accounts?
How should user-interaction tasks distinguish helpful clarification from manipulative pressure?
How should MCP-enabled mobile agents prove that tool permissions, credentials, and logs were scoped to the task?

Sources

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang, MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments, arXiv:2512.19432 [cs.CL], submitted December 22, 2025; v3 revised December 30, 2025.
Tongyi-MAI project page, MobileWorld: Benchmarking Autonomous Mobile Agents, reviewed June 25, 2026.
Tongyi-MAI GitHub repository, Tongyi-MAI/MobileWorld, Apache-2.0 licensed codebase, reviewed June 25, 2026.

Return to Wiki