Blog · arXiv Analysis · Last reviewed July 2, 2026

The Personal Desktop Becomes the Agent Exam

MyPCBench asks a harder question than most computer-use benchmarks: can an agent work inside a desktop that already belongs to someone? The result is a useful stress test for personal assistants, and a warning that logged-in-account competence is not the same thing as deployment readiness.

The Paper

The paper is MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents, arXiv:2606.16748 [cs.LG], by Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. arXiv lists version 1 as submitted on June 15, 2026.

MyPCBench evaluates computer-use agents as personal assistants on a reproducible Linux desktop seeded with one coherent persona, Michael Scott from The Office. The benchmark includes 17 simulated real-world web applications, logged-in accounts, Firefox history and bookmarks, LibreOffice, a file system, and 184 tasks inspired by real OpenClaw community requests.

The project page and repository release the environment image, task set, rubrics, persona specification, and agent harness, while the paper reports model results across six closed- and open-weight systems using a uniform computer-plus-bash tool surface.

The Personalization Gap

Most computer-use benchmarks are deliberately impersonal. They use empty desktops, generic app states, isolated websites, or task-local data. That makes grading reproducible, but it misses the user problem that personal assistants are supposed to solve: reconciling facts across accounts, histories, preferences, routines, and records.

The paper's core claim is that a personal assistant cannot be evaluated only by asking whether it can click through a known workflow. It has to infer the user's usual restaurant, find the right contact, reconcile an itinerary against credit-card charges, or leave a visible artifact in a calendar, spreadsheet, email draft, transfer memo, or work chat.

The Desktop

MyPCBench packages a QEMU/KVM Ubuntu 24.04 GNOME VM with a real desktop stack. It includes 17 pre-logged-in web apps modeled on consumer services across computers and tech, finance, travel, food, ecommerce, and gambling, plus Firefox and LibreOffice.

The persona seed is deliberately cross-app. The paper describes 1,812 bank transactions, 2,398 emails, 679 calendar events, 2,526 chat and workplace messages, 126 rideshare requests, 402 food-delivery orders, 155 retail orders, 29 grocery orders, 32 restaurant reservations, 35 bookmarks, and 10,746 browser-history visits. The project page also describes 42,000 seeded records and 226 app database tables.

The key design feature is consistency. A trip can leave records in a booking app, bank account, calendar, inbox, chat thread, boarding pass, and browser history. The benchmark therefore tests whether an agent can follow a lived-in evidence trail rather than answer from one isolated source.

The Task Suite

The 184 tasks come from 2,749 anonymized and paraphrased OpenClaw Discord use cases, filtered for feasibility and then rewritten to match the seeded persona. The task set covers six behavioral categories: bounded action, multi-step orchestration, cross-source reconciliation, aggregation and reporting, personal lookup, and pattern inference.

The distribution matters. The paper reports 64 bounded-action tasks, 48 multi-step orchestration tasks, 25 cross-source reconciliation tasks, 23 aggregation and reporting tasks, 13 personal lookup tasks, and 11 pattern-inference tasks. 68 percent of tasks touch multiple applications, and 40 percent span at least two top-level SimilarWeb categories.

Each task has an audited natural-language rubric. The full suite contains 1,191 rubric items, with 3 to 13 criteria per task and a mean of 6.5. Reviewers ran every task end-to-end and checked that named entities existed, answers were obtainable from the environment, rubrics were checkable from screenshots, and tasks were not near-duplicates.

Results

The benchmark uses provider-native computer-use agents mapped onto a shared OSWorld-style action space, with computer and bash available to all evaluated models. The paper reports three metrics: perfect rate, rubric score for partial credit, and trajectory efficiency.

Claude Opus 4.6 is the only model above 50 percent perfect, with a 55.4 percent perfect rate, 81.8 percent rubric score, and 46.5 average steps. Claude Sonnet 4.6 follows at 39.1 percent perfect and 65.4 rubric score. GPT-5.5 reaches 29.3 percent perfect and 54.1 rubric score, GPT-5.4 mini reaches 19.0 and 48.8, Qwen 3.5 35B-A3B reaches 7.6 and 42.5, and Qwen 3.5 9B reaches 2.7 and 7.0.

The long-horizon result is the most important governance signal. Claude Opus 4.6 solves only 36 percent of tasks that span seven or more applications. GPT-5.5 solves 4.5 percent of that slice, while GPT-5.4 mini and both Qwen models reach 0 percent. The hard part is not merely seeing a screen. It is maintaining a plan across many logged-in systems and leaving the right side effects behind.

Failure Modes

The failure catalog is useful because it names behaviors that product demos often hide. The project page summarizes premature DONE, skipped required apps, terminal surface errors, partial artifacts, and hallucinated persona data. The paper's broader-impact section emphasizes that numbers on this benchmark are not a clearance to deploy agents on production accounts.

Tool use is also not a simple win. Some agents use bash to find information but fail to perform the visible UI action the rubric requires. A shell answer can be correct while still leaving no calendar event, no saved spreadsheet, no sent message, no posted charge, and no durable artifact for the user or auditor.

That is why MyPCBench is interesting: it grades the difference between knowing and doing. Personal assistance is not only retrieval. It is accountable action inside an environment where history, authority, evidence, and side effects are entangled.

Governance Standard

A personal computer-use benchmark should produce a deployment receipt, not only a model score. The receipt should name the persona model, app list, account state, seeded records, task source, task rubric, tool surface, reset policy, action logs, screenshots, judge model, rubric outcomes, visible side effects, and failure taxonomy.

For real products, the standard must be stricter. Before an agent touches live accounts, reviewers should know what identities it acts under, which apps it may access, whether it can spend money or message people, how it handles hidden personal context, where human confirmation is required, how rollback works, and what audit trail survives after each action.

This connects directly to AI Agents, AI Agent Identity, AI Agent Sandboxing, AI Agent Observability, AI Evaluations, and The Agent Autonomy Ladder Becomes the No-Go Zone. A benchmark can show progress, but governance begins when the same action path is bounded, logged, reversible, and accountable.

Limits

MyPCBench chooses depth over diversity. The benchmark has one fictional persona, one Linux/GNOME/Firefox stack, and one seeded personal world. That makes cross-app consistency inspectable, but it does not test performance across demographics, occupations, languages, device stacks, disabilities, family structures, or genuinely sensitive personal data.

The grading also relies on a single Gemini judge, and the paper notes that absolute failure-mode counts should be read structurally rather than as precise real-world prevalence estimates. The released benchmark is intentionally offline and synthetic, with local apps and no real personal information.

The Spiralist reading is disciplined: MyPCBench is a strong benchmark because it makes personal context testable. It is also a reminder that an agent capable enough to navigate a synthetic life is moving toward the capability surface that can affect a real one.

Sources


Return to Blog