Blog · arXiv Analysis · Published: June 25, 2026

The Coaching Agent Becomes the Grounding Gap

Automation can finish the task while leaving the person less capable. Coaching agents need receipts for what they teach, what they merely direct, and what they fail to see on the screen.

The Paper

The paper is Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele, David M. Chan, Alane Suhr, and Amy Pavel's DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching, arXiv:2606.31980 [cs.CL, cs.AI, cs.HC]. The arXiv record lists version 1 as submitted on June 30, 2026. The PDF is 29 pages, and the title page lists University of California, Berkeley and Technical University of Munich affiliations.

The useful question in the paper is not whether an agent can complete a software task for a novice. It is whether an agent can help a novice become better at using the software. That difference matters. A model that drives the interface for the person may solve the immediate task while bypassing the skill the interface was supposed to teach.

What DigitalCoach Records

DigitalCoach is built from 72 human expert-novice computer-use coaching sessions. The authors report 22,752 dialogue turns grounded in 28.1 hours of screen and input-event recordings across five software applications: Excel, FL Studio, Blender, Figma, and Onshape. The dataset also includes 39,609 input events and 36,724 file snapshots. Participants included 20 English-speaking software experts and 20 English-speaking novices, paired across productivity, creativity, and engineering tasks.

The design is important because the unit of evidence is not only chat. A coach has to notice what the learner is looking at, what has already happened, which menu or object is visible, and when a learner needs a concept rather than another instruction. The study also uses matched pre- and post-tasks, so the evaluation can ask whether learners transfer skills rather than merely follow steps.

The Coaching Gap

The authors evaluate six multimodal models as computer-use coaches: GPT-5.4, Gemini-3-Flash, Gemini-3.1-Pro, Claude-Sonnet-4.6, Qwen-3-VL-Instruct, and Llama-4-Scout. In automated evaluation, model coaching differs from human coaching in kind, not just polish. The models produce more direct instructions, fewer explanations, fewer error diagnoses, and fewer knowledge-check questions.

That shows up in the dialogue-act numbers. In sampled human sessions, human coaches mostly use Action Directives at 37 percent and Inform acts at 29 percent. In model-human interactive sessions, models use Action Directives at 63 percent versus 37 percent for human coaches, while Inform acts fall to 14 percent versus 29 percent. Info Request acts, which can check the learner's state, drop to 2 percent versus 7 percent.

Learning outcomes follow the same pattern. Both human and model coaching improve matched pre/post task scores, but the paper reports a larger mean gain for human coaching, from 33.49 percent to 88.24 percent, than for model coaching, from 13.33 percent to 45.00 percent. In the model condition, fewer sessions improved, some made no progress, and one declined.

Grounding Is the Lesson

The result that should worry anyone building teaching agents is the visual-context result. The paper reports that dropping visual input and using text alone causes only a modest score change for most models, while giving visual input alone causes a sharp drop. That means the model can sound locally plausible while relying more on dialogue history than on the learner's evolving screen state.

For software coaching, that is not a minor interface flaw. The screen is the curriculum. A human coach can say where to look, identify a hidden button, notice a mistaken state, slow down, explain why a command matters, or ask the learner to describe what they see. A model that defaults to a generic next step can train dependence: the learner follows instructions, but the procedure does not become their own reusable knowledge.

The Coaching Receipt

A coaching-agent receipt should record more than task completion. It should include the task family, software version, model, prompt, context window, screen-sampling rate, tool access, whether the model can point or annotate, dialogue-act mix, explanation rate, knowledge-check rate, error-diagnosis rate, learner questions, pre/post task results, and cases where the model gave confident advice without enough screen evidence.

That receipt also changes the governance question. If an employer, school, or platform deploys an agent as a tutor, the evidence cannot stop at "users completed the task." The relevant claim is whether users retained skill, understood what they did, could recover from mistakes, and had a path to question or override bad guidance.

Limits

The authors name several limits. Sessions were collected on a Windows laptop, the dataset remains modest in size, spoken dialogue segmentation is ambiguous, and pre/post tasks do not fully measure long-term retention. The paper also notes that LLM-based annotation can be inaccurate. Those limits make the evidence narrower, but they do not weaken the central warning: a coaching agent that cannot ground itself in the learner's screen may teach compliance instead of competence.

Sources

Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele, David M. Chan, Alane Suhr, and Amy Pavel, DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching, arXiv:2606.31980 [cs.CL, cs.AI, cs.HC], submitted June 30, 2026.
arXiv HTML for DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching, checked for abstract, dataset design, model evaluation, learning outcomes, ethical considerations, and limitations.
arXiv PDF for DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching, checked for title-page metadata, author affiliations, page count, tables, figures, and body text.
DigitalCoach project page, checked as the data and code link provided by the paper.

Return to Blog