The Human-Agent Pair Becomes the Skill Rating
The CollabSkill paper matters because it moves agent evaluation out of the solo run and into the paired work session. The question is no longer only whether an agent can finish a task alone, but whether a worker-agent pair can produce inspectable work on occupational tasks.
Pair, Not Model
Most agent benchmarks ask whether the machine can complete a task after receiving an instruction. That is useful, but it is a narrow picture of workplace deployment. Agents are often used through dialogue, delegation, correction, and partial automation. The unit of performance is not just a model. It is a human, an interface, a task, a tool surface, and a record of back-and-forth work. CollabSkill treats that pairing as something to measure directly, rather than something to infer from solo leaderboards.
The Paper
arXiv lists CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks as arXiv:2606.09833v1 [cs.HC], submitted April 20, 2026. The authors are Yijia Shao, Zora Z. Wang, Neel Ahuja, Yicheng Wang, Bowen Liu, and Diyi Yang; the paper lists Stanford University, Carnegie Mellon University, the University of Rochester, and one individual researcher.
The paper introduces CollabSkill as a framework for evaluating human-agent collaboration on real-world occupational tasks. It pairs real workers with agents on tasks matched to their occupational background. The infrastructure links prompts, reference files, and deliverables to O*NET occupational categories, then grades open-ended deliverables with rubric generation and multi-agent scoring.
The study collected 386 working sessions, over 1,500 user prompts, and contributions from 93 human workers across 10 O*NET sectors. Participants averaged 9.6 years of professional experience in their selected occupation, and 76 completed both pre-study and post-study surveys. The benchmark is also a record of how people with domain backgrounds try to use agents.
What It Measures
Raw task outcomes are messy in a collaboration setting. A score can reflect the agent's planning, the worker's task knowledge, the worker's AI fluency, the interface, the prompt history, or the grading method. CollabSkill addresses this with a Bayesian skill rating system inspired by multiplayer game rating. The model estimates latent skill contributions for both agents and humans from team outcomes.
The agent table reports a conservative CollabSkill score as posterior mean minus three standard deviations. An agent with uncertain evidence should not outrank a better-supported agent from a high point estimate. Collaboration claims should carry uncertainty, sample composition, and task coverage, not only a leaderboard number.
Rankings Shift
Among the five evaluated agents, the paper reports Claude Cowork with the highest estimated CollabSkill score, 76.738. Among terminal-based agents, Claude Code performs best, and the paper reports the probability of Claude Code ranking above Codex as greater than 0.999. This is notable because the authors say the order diverges from fully autonomous evaluations where Codex led in their solo setting.
The paper also separates model strength from interface design. Claude Cowork and Claude Code are described as sharing the same underlying LLM in the study, but Claude Cowork ranks above Claude Code with probability greater than 0.999. The governance lesson is direct: if the workplace use case is collaboration, then the interface is not decoration. It is part of the system being evaluated.
The Worker Is Signal
The human-side result is more important than the agent table. In head-to-head comparisons against the same agent running autonomously, top-quartile workers reached a 74 percent win rate, while bottom-quartile workers reached 27 percent. CollabSkill also reports that observed AI fluency behaviors correlated with session score, while raw user turn count did not show a significant correlation. Longer interaction was not the same as better collaboration.
Survey-linked results point in the same direction. The detailed analysis reports that self-reported LLM familiarity correlated with collaboration skill, with Spearman rho 0.297 and p = 0.010; comfort with delegating work also correlated, with rho 0.238 and p = 0.041. After hands-on collaboration, participants shifted their perception of AI capability toward greater autonomy, while their preferred autonomy level for meaningful work stayed unchanged. Experience can change beliefs about what agents can do without changing what workers want agents to decide.
Governance Reading
This page belongs beside enterprise role maps, procedural memory in workplace agents, dialogue-level collaboration measures, AI literacy, and agent evidence trails. The shared problem is evaluation mismatch: a system bought for human work should not be justified only by an autonomous run.
CollabSkill is useful because it makes the pair visible. The worker is not a temporary obstacle between the benchmark and automation. The worker is part of the capability envelope, the risk surface, and the evidence record. Organizations that ignore this will treat training and interface design as afterthoughts.
Limits
The authors are clear about boundaries. Participants were recruited through Upwork and were US-based, so findings about worker CollabSkill and AI literacy carry selection bias. The dataset spans 386 sessions from 93 workers and 10 of 20 O*NET sectors, broader than many software-only studies but not representative of the full occupational landscape. The scalar task score cannot capture every quality of human work, including creativity, reaction under pressure, and interpersonal communication. Cost-effectiveness is left to future work.
The paper says CollabSkill reflects the state of agents as of early 2026. It also cautions against using estimated human skill scores to screen or evaluate individual workers for employment or other consequential decisions. That warning should travel with the benchmark. A collaboration score can help audit systems and study AI literacy. It should not become a quiet labor-rating machine.
Collaboration Receipt
A human-agent collaboration receipt should record: task source, occupational category, worker background match, agent version, interface type, tools available, prompt and artifact history, rubric, grader configuration, human interventions, failed attempts, score, uncertainty, and survey fields. The audit-grade sentence is: this outcome came from this pair, on this task, through this interface.
Sources
- Yijia Shao, Zora Z. Wang, Neel Ahuja, Yicheng Wang, Bowen Liu, and Diyi Yang, CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks, arXiv:2606.09833v1 [cs.HC], submitted April 20, 2026.
- Primary arXiv versions checked: PDF and experimental HTML, reviewed for title, authorship, date, participant count, working-session count, O*NET coverage, Bayesian rating method, agent rankings, human-side findings, and limitations.
- Related pages: The Enterprise Role Matrix Becomes the AI-Native Work Map, The Workplace Skill Becomes Procedural Memory, The Dialogue Dynamics Become the Collaboration Meter, AI Literacy, and The Agent Breadcrumb Becomes the Oversight Trail.