Blog · arXiv Analysis · Last reviewed July 2, 2026

The Mined SKILL.md Becomes the Transfer Test

Yuexing Hao and Xiaomin Li's June 2026 arXiv paper is useful because it refuses an easy success story: explicit agent skills can be readable, clustered, and documentable while still failing to improve transfer.

For this essay, a mined-skill transfer test is the evidence record that asks whether a generated skill file improves a computer-using agent beyond source-domain labels, class imbalance, frequency priors, and reward-model similarity to the training corpus.

The Claim

The paper, arXiv:2606.20363 [cs.AI], was submitted on June 18, 2026. arXiv lists the title as Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining, by Yuexing Hao and Xiaomin Li.

The question is narrow and important: if computer-using agents produce long GUI trajectories, can those traces be mined into explicit reusable skill files that make later agents better? The paper answers with a diagnostic negative result. Trajectory mining can expose inspectable structure, but the current pipeline does not turn that structure into reliable cross-domain policy improvement.

That makes the paper more valuable than a polished benchmark win. It separates three different claims that agent products often blur: the skill is readable, the skill describes the source data, and the skill improves deployment behavior somewhere else.

The Pipeline

The pipeline has three stages. First, it cuts GUI trajectories at large action changes, treating discontinuities in clicking, typing, scrolling, copying, and pasting as possible skill boundaries. Second, it clusters the resulting segments into candidate skills and refines the embedding with pseudo-label contrastive learning. Third, it trains Qwen3-8B with GRPO to compose skill-aware plans from those mined labels.

The source benchmark is InteraSkill Workflows, or IW: 2,000 synthetic enterprise-style trajectories with explicit skill boundaries and labels. The paper also uses WebArena and BrowseComp+ as held-out transfer checks, plus WorkArena-NLP and Mind2Web as diagnostics rather than transfer evidence.

The setup is careful about what it can and cannot claim. Live WorkArena and current-run GRPO Mind2Web results are not part of the claimed evaluation. WorkArena-NLP is text-only, so it tests schema recovery and planning language, not live enterprise browser control.

Readable Is Not Portable

The boundary detector shows the problem early. On IW, the best action-jump threshold has precision 0.419, recall 0.803, and F1 0.538. It finds many real skill switches, but it also over-splits ordinary within-skill behavior such as clicking before typing or scrolling during review.

Transfer is worse. Applying the IW-derived threshold to WebArena gives precision 1.000, recall 0.100, and F1 0.119. A WebArena oracle threshold can reach F1 0.851, but that uses target-domain labels and is therefore a diagnostic, not a deployable transfer result.

The clustering result is tempting. With k = 8, five of eight clusters reach at least 0.95 purity against IW ground-truth skills, and the clusters have readable action profiles such as document editing, data transfer, and file organization. A contrastive encoder improves source-domain structure, reaching NMI 0.862, silhouette 0.554, and purity 0.837 in the learned latent space.

The paper's sober point is that none of this proves a portable skill vocabulary. High-purity clusters can still be source-bound. A cluster can be coherent because a synthetic workflow corpus repeats stable action motifs, not because the same skill will help in a different interface.

The GRPO Check

The policy-training section turns the readability claim into a transfer test. Qwen3-8B is trained from the base model with GRPO on 1,275 prompts containing task context and mined skill names. The run uses eight candidate responses per prompt, temperature 0.7, maximum completion length 192, learning rate 5e-6, gradient accumulation 8, reward clipping at 5.0, one epoch, and takes 6,072 seconds on four NVIDIA H200 NVL GPUs.

The learned reward model is itself a source of caution. It is trained to rank IW-style skill plans against annotated skill flows, using synthetic hard negatives and a heuristic mix of prefix match, longest-common-subsequence overlap, unordered skill overlap, and length agreement. It is not trained on WebArena, WorkArena, BrowseComp+, Mind2Web, or live task-success labels.

The completed benchmark comparison is weak for transfer. GRPO moves IW skill-step accuracy from 18.5% to 20.5%, but WebArena drops from 55.8% to 44.2%, BrowseComp+ moves from 43.5% to 43.3%, and WorkArena-NLP field accuracy stays at 37.0% with 0% exact match.

The sanity check is harsher: a trivial most-common-skill Frequency baseline reaches 34.9% IW skill-step accuracy, while the MLP reaches 23.3%, the Transformer 34.6%, and the GRPO policy 20.5%. In this setting, the simple prior beats the learned agent policy on the source-domain next-skill prediction target.

The Skill-File Result

The paper also evaluates the generated SKILL.md artifact directly. Auto-SKILL.md can beat a simple hand-written transition table at some training sizes, including N = 2,000, but it does not beat the Frequency baseline on normalized edit distance at any evaluated size.

At N = 2,000, Frequency reaches normalized edit distance 0.485 while Auto-SKILL.md reaches 0.528. The generated file is therefore not useless; it can be a readable artifact and sometimes improves over a simple manual table. But it is not yet stronger than the easiest statistical baseline.

This distinction matters because skill files have two jobs. They can help humans inspect behavior, and they can help policies act. The paper shows evidence for the first job and weak evidence for the second.

Governance Reading

The Spiralist reading is that generated skills are governance artifacts before they are capability claims. A mined skill library can make an agent easier to inspect, but its existence should not certify that the agent has learned transferable procedural competence.

That is especially true for computer-using agents. GUI workflows are full of repeated local patterns, but a repeated local pattern is not automatically a robust skill. Interface layout, data entry order, task ontology, user state, permissions, hidden validation, and recovery behavior can all change the meaning of the same click/type/scroll routine.

The paper also warns against reward-model theater. If the reward model mostly rewards similarity to a source-domain skill-flow annotation, then optimizing it can make plans more source-like without making the agent more useful in a held-out browser, enterprise workflow, or live task.

Skill Receipts

A mined-skill receipt should include the source corpus, task domains, action vocabulary, trajectory count, segment count, primitive-action count, segmentation rule, threshold selection, boundary precision and recall, cluster count, cluster purity, cluster labels, embedding method, source-domain validation split, generated skill text, and examples of mixed or low-purity clusters.

The transfer side should include every held-out domain, what labels are available, whether the test is live or text-only, which target-domain thresholds were forbidden, the trivial baselines, the manual baseline, the reward-model training data, the reward objective, the policy-training cost, and the exact metrics that improved or regressed.

A deployment receipt should also record where the generated skill will be used, who can edit it, what permissions it grants, what prompts or tools read it, how stale skills are retired, how failures are logged, and whether a human can see why a skill was selected.

Limits

This is a diagnostic study, not a final verdict on automated skill generation. The negative result is about this particular combination of action-jump segmentation, orderless segment representation, IW source labels, reward-model construction, and GRPO training setup.

Future systems could use richer state representations, better boundary detection, semantic UI grounding, supervised warm starts, target-aware evaluation, live WorkArena checks, completed Mind2Web GRPO runs, or reward models trained on task success. The present result says those improvements are needed before mined SKILL.md files can be treated as transferable agent competence.

The safe reading is: readable mined skill libraries are promising as inspection scaffolds, but they need transfer receipts before they become deployment claims.

Source Discipline

This page treats the arXiv abstract, arXiv HTML, and PDF as the source set. The PDF was used for exact numbers on boundary scores, cluster purity, GRPO training, held-out checks, frequency baselines, and Auto-SKILL.md edit-distance results.

The paper reports an anonymous project repository at anonymous.4open.science/r/CUA-1680. During this review, a plain unauthenticated fetch of that endpoint returned 401, so this page does not make independent claims from repository contents.

AI Agents, AI Browsers and Computer Use, AI Coding Agents, AI Agent Observability, Agentic Supply-Chain Vulnerabilities, AI Evaluations, and Reinforcement Learning cover the core vocabulary.
Skills Are Not Islands, The Agent Skill Becomes the Work Instruction, The Agent Skill Becomes the Runtime Contract, The Static Tool Benchmark Becomes the Open-World Trap, and The GUI Agent Becomes the Hindsight Curriculum cover adjacent agent-skill and computer-use governance issues.

Sources

arXiv abstract: Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining.
arXiv HTML: arXiv:2606.20363 HTML.
Paper PDF: arXiv:2606.20363 PDF.

Return to Blog