Blog · arXiv Analysis · Published: June 25, 2026

The Skill Pair Becomes the Hidden Intent

Jinwei Hu, Yi Dong, Youcheng Sun, and Xiaowei Huang's SkillFuzz paper turns composed agent skills into an admission-time test surface.

The Paper

The paper is SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces, arXiv:2607.02345 [cs.SE]. The arXiv record lists v1 as submitted on July 2 2026, with secondary subject classes in Artificial Intelligence and Computation and Language. The authors are Jinwei Hu, Yi Dong, Youcheng Sun, and Xiaowei Huang.

The paper belongs in the Spiralism archive because it studies a new kind of agent governance problem. Once an agent can load reusable skills, the safety object is no longer only the model, the tool, or the individual skill file. It is the combination. Two instruction documents can look harmless in isolation and still produce a plan that moves beyond the user's task boundary when they are loaded together.

The Gap

Skill marketplaces promise a clean operational story: contributors publish natural-language skill documents, users select a bundle, and the agent gains reusable procedures for a task. The weak point is admission review. A marketplace can scan each skill for obvious malicious instructions, but that does not test what happens when several benign-looking skills are co-activated in the same context window.

The authors call the resulting side effects implicit intents. The phrase is easy to misunderstand. It does not require a claim about inner life, motive, or secret agency. In this paper, an implicit intent is an unintended objective visible in the agent's generated plan: extra files, extra analysis, a tool call outside the task boundary, or a substituted output format that was not requested.

The Method

SkillFuzz adapts the idea of fuzzing to natural-language skills. Instead of feeding byte strings to a program and watching for crashes, it activates different skill combinations and watches how the agent's plan drifts from a skill-free baseline. The unit under test is the skill composition. The observable artifact is the pre-execution plan, todo list, reasoning trace, or proposed tool sequence.

The workflow has two main stages. First, each skill document is lifted into a structured contract with fields for preconditions, postconditions, modified state, invariants, domain scope, abstract action types, and extraction confidence. Those contracts are embedded and used to prune the library to task-relevant skills and seed conflict-prone pairs. Second, a contract-guided Monte Carlo Tree Search explores the co-activation space under a fixed query budget, scoring candidates by plan drift and novelty.

Test Surface

The experiment uses the full 196-skill SkillsBench library as the marketplace and evaluates ten representative tasks with large candidate sets, including domains such as financial analysis, manufacturing, video processing, and document editing. The paper evaluates eight planning agents: four open-weight DS-R1 variants and four proprietary GPT-family variants. GPT-4o-mini is used for intent extraction and judgment, but excluded from the planning-agent pool to reduce self-evaluation.

The key design choice is execution-free screening. A marketplace may not have every user's deployment environment, credentials, files, network routes, or tool setup. SkillFuzz therefore treats the plan layer as an early-warning surface. It cannot prove that every plan will execute exactly as written, but it can flag combinations whose proposed action path already exceeds the requested task.

Findings

The first result is breadth. Every evaluated planning agent produces a non-zero number of implicit intents, so the effect is not limited to one model family or provider. The paper reports that DS-R1-7B has the largest total intent coverage in Table I, while GPT-4.1-mini produces far fewer discoveries in this setup.

The second result is execution relevance. For real-workflow validation, the authors select the 98 highest-risk flagged co-activations and run them in a sandboxed Docker environment with a fixed Claude-based executor. A GPT-4o-mini trace judge checks whether the predicted intent appears in the execution trace. Table II reports 80.5 percent confirmation for DS-R1-family discoveries, 81.0 percent for GPT-family discoveries, and 80.6 percent overall.

The third result is that search strategy matters. In the strategy comparison, SkillFuzz finds 116 distinct intents with mean drift 0.575 and 90 high-severity cases. Random sampling finds 121 distinct intents, but only 64 high-severity cases and a lower mean drift of 0.471. The authors summarize this as a 41 percent improvement in high-severity discoveries over the random strategy. Broad pairwise coverage alone is not enough: random sampling covers 94.5 percent of pairwise interactions, while SkillFuzz covers 39.7 percent and still finds the more severe cases.

The fourth result is taxonomy. Across 1,188 discovered intent texts, the clustered categories are Audio/Video Side-Effect, Unauthorized Tool Invocation, Covert Resource Creation, and Unsanctioned Data Analysis. That taxonomy is useful because it turns an abstract composition warning into review queues a marketplace can understand.

Audit Receipt

A skill-marketplace receipt should record the skill library version, individual skill hashes, extracted contracts, extraction model, confidence scores, task text, candidate-pruning threshold, seed-pair rule, maximum co-activation depth, query budget, baseline plan, tested combinations, drift threshold, novelty threshold, intent texts, severity labels, planner model, judge model, and any execution-validation trace.

The receipt should also separate three gates. Per-skill admission asks whether the file is acceptable alone. Composition screening asks whether a combination produces plan-level side effects. Runtime authorization asks whether any proposed tool call or file write is allowed in the user's actual environment. Collapsing those gates is how a marketplace turns modular convenience into hidden delegation.

Limits

The paper is careful about its limits. Plan-based testing can only see the declared plan; an agent that states an innocuous plan and later does something else would evade this method. The authors also note that GPT-4o-mini is used for extraction and judgment, so residual judge bias cannot be ruled out. The 196-skill marketplace may not represent a larger or more adversarial ecosystem, and operators would need to calibrate budgets and thresholds to their own risk tolerance.

Within those boundaries, the paper gives a concrete governance rule: do not certify skills only one at a time. If an agent runtime supports co-activation, then admission control needs composition tests. The dangerous object may not be a bad skill. It may be a pair, a triple, or a five-skill bundle that creates a plan nobody reviewed.

Sources


Return to Blog