Blog · arXiv Analysis · Last reviewed July 2, 2026

The Paired Trajectory Becomes the Skill Audit

SkillAudit treats a skill document as operational code. It improves that document by running the same task with and without the candidate skill, then asking which observed differences justify a localized edit, a commit, or a rollback.

The Paper

The paper is SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing, arXiv:2606.14239 [cs.AI], by Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, and Xueqi Cheng. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14239.

The affiliations in the arXiv HTML are the State Key Laboratory of AI Safety, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; and Tongyi Lab, Alibaba Group. The paper's question is practical: how do you keep a procedural agent skill useful after the deployment world changes if you do not have hidden tests, reference solutions, environment rewards, or human validation scripts?

Skill Drift

The paper defines agent skills as structured procedural instruction packages that guide frozen large language model agents through specialized workflows. They are external artifacts, not model weights. A model can stay frozen while the skill document, its helper files, or its routing instructions change.

That separation makes skills deployable, but also makes them drift. Edge cases appear, tools and APIs change, data formats shift, and local deployment constraints emerge only after use. A skill that once helped an agent can become incomplete, noisy, over-specific, or actively harmful.

This is the difference between a prompt and a working procedure. If a skill decides which file to read, which command to run, which schema to emit, and which path is forbidden, then its operational status should be audited the way a lightweight runtime contract is audited.

Paired Audit

SkillAudit's central mechanism is paired trajectory auditing. For each iteration, the same task is executed twice: once with the candidate skill and once without it. The contrast isolates what the skill changed in the agent's behavior without asking for a reward label.

The allowed evidence is deliberately narrow: task description, workspace data, candidate skills, execution trajectories, generated artifacts, and constraints derivable from the task specification. The evolution loop does not access hidden tests, reference solutions, held-out validation scores, task rewards, oracle pass/fail feedback, or human-authored validation scripts.

The important move is not that the system sees two transcripts. It is that the transcripts are tied back to the skill text. A difference in behavior must become a passage-level diagnosis before it can justify editing the skill. The system then gates each iteration with a three-way verdict: skill_helped, skill_hurt, or skill_inert.

PACE and the Anchor

Process-Aligned Contrastive Evaluation, or PACE, is the paper's diagnostic layer. PACE compares with-skill and without-skill trajectories at divergence points and emits segment-anchored action_signals and protected_hints. The point is to distinguish useful procedure from distracting or harmful prose.

The full PACE inventory contains 12 evaluator templates across four dimensions. Process Adherence includes eval-procedure-adherence, eval-coverage, eval-tool-use-rationality, and eval-error-robustness. Artifact Evidence includes eval-output-evidence-check and eval-format-compliance. Consistency includes eval-task-alignment, eval-data-consistency, eval-method-adherence, and eval-safety-compliance. Effectiveness Delta includes eval-incremental-value and eval-portfolio-quality.

PACE is soft, so the paper pairs it with a fixed structural verifier called the Anchor Verifier. The Anchor Verifier is compiled once from the task specification, extracts only checkable constraints such as required files, exact headers, schemas, enumerated values, numeric fields recomputable from workspace data, and companion files, and is then locked for the rest of evolution. If the Anchor Verifier detects a regression, the update rolls back even if PACE sounds positive.

Refine and Repair

SkillAudit routes each task to one of two edit pipelines. Refine is for a skill whose core is broadly useful but noisy, redundant, or imprecise. It is subtraction-first: delete distractions, clarify local instructions, and protect passages associated with helped behavior.

Repair is for genuine conflict. If the skill prescribes outdated APIs, wrong outputs, incompatible paths, or task-contradicting workflows, Repair can replace or delete the offending passage. It is more permissive than Refine, but every edit must still be anchored to a PACE surgery target and grounded in observed trajectory divergence.

The paper's case studies make the distinction concrete. software-dependency-audit is a Refine case where a useful skill is pruned and de-hardcoded. data-to-d3 is a Repair case where conflicting D3 guidance is removed and a small verbatim-label reminder survives. Appendix cases include lab-unit-harmonization, where the unit-mapping table is protected, and exceltable-in-ppt, where a formula-destroying workflow step is rolled back and deleted.

Results

The benchmark uses 89 runnable SkillsBench tasks across 8 professional domains, executed inside Harbor containers. Evolution and evaluation are separated: the evolution loop runs in a stub container without access to pytest verifier content, and the real verifier executes only after the loop terminates.

The reported average task reward is 73.9 percent for SkillAudit, compared with 40.9 percent for no skill and 56.7 percent for the static expert skill shipped with the benchmark. That is a +33.0 point gain over no skill and a +17.2 point gain over the static skill. The paper reports improvement over the static skill in seven of the eight domains and a match in Finance and Economics.

The stronger finding is the observability boundary. Skills evolve well when their useful knowledge leaves inspectable traces in artifacts, file structures, schemas, numbers, runtime errors, or other concrete outputs. They evolve less well when the relevant knowledge is semantic, procedural, or judgment-heavy and does not leave a structural mark the auditor can read.

The structural analysis is also useful for practitioners. Successful edits tend to strip tutorial prose, remove off-domain bundled skills, de-hardcode paths and versions, inline constraints near the step they govern, add exact I/O contracts, and create a navigation layer for multi-file skills. In the authors' reading, a good skill is not a tutorial. It is a verifier-observable execution contract.

Governance Standard

A self-evolving skill system should ship a skill evolution receipt. The receipt should include the task description, workspace hash, initial skill, candidate skill, with-skill trajectory, without-skill trajectory, generated artifacts, PACE evaluator versions, segment references, action_signals, protected_hints, Anchor Verifier constraints, Anchor Verifier results, commit, defer, or rollback decision, Refine or Repair route, skill diff, rollback version, model version, container image, domain label, iteration count, and final evaluator status withheld from the evolution loop.

That receipt matters because the system is changing a document that changes agent behavior. A changed skill can alter tool choice, retry behavior, output shape, file access, and safety constraints. The governance question is not whether the final score improved. It is whether the institution can explain why a given passage was deleted, protected, rewritten, or rolled back before the evolved skill becomes reusable infrastructure.

This connects directly to AI Agents, AI Agent Observability, AI Audits and Assurance, AI Audit Trails, The Agent Skill Becomes the Runtime Contract, The Workplace Skill Becomes Procedural Memory, The Harness Becomes the Runtime Contract, The Agent Rulebook Leaves the Prompt, The Reasoning Tree Becomes the Commit Log, and The Agent Log Becomes the Receipt. Agent behavior becomes governable when the procedure, trace, edit, verifier, and rollback rule are part of the same record.

Limits

SkillAudit is label-free during evolution, not evidence-free. Its evidence is observable behavioral contrast. That creates a real boundary: if the difference between useful and harmful skill text does not surface in a trajectory, artifact, Anchor check, or runtime outcome, the system has little basis for a good edit.

The Anchor Verifier is intentionally narrow. That reduces false rollbacks, but it also means incomplete constraints can miss harm. PACE can identify divergence and localize likely causes, but it remains LLM-generated judgment. The fixed verifier controls drift only for constraints that are derivable and encoded.

Paired execution is also expensive. Every iteration requires with-skill and without-skill runs, and uninformative without-skill runs can make contrast weak. The paper handles degenerate without-skill trajectories by nulling the incremental signal and disabling swap-style gap filling, but the edit loop then has less evidence to work with.

The final benchmark rewards are used for evaluation after evolution, not as signals during evolution. That distinction is the paper's central claim and also a deployment caution: a real organization still needs external validation before trusting an evolved skill, especially where the task's quality dimensions are hidden, semantic, or safety-critical. At review time, I found arXiv, PDF, HTML, and paper-indexing pages, but no official code repository linked from the arXiv record.

Sources


Return to Blog